Computer Vision and Pattern Recognition 127
☆ TWIST2: Scalable, Portable, and Holistic Humanoid Data Collection System
Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, C. Karen Liu
Large-scale data has driven breakthroughs in robotics, from language models
to vision-language-action models in bimanual manipulation. However, humanoid
robotics lacks equally effective data collection frameworks. Existing humanoid
teleoperation systems either use decoupled control or depend on expensive
motion capture setups. We introduce TWIST2, a portable, mocap-free humanoid
teleoperation and data collection system that preserves full whole-body control
while advancing scalability. Our system leverages PICO4U VR for obtaining
real-time whole-body human motions, with a custom 2-DoF robot neck (cost around
$250) for egocentric vision, enabling holistic human-to-humanoid control. We
demonstrate long-horizon dexterous and mobile humanoid skills and we can
collect 100 demonstrations in 15 minutes with an almost 100% success rate.
Building on this pipeline, we propose a hierarchical visuomotor policy
framework that autonomously controls the full humanoid body based on egocentric
vision. Our visuomotor policy successfully demonstrates whole-body dexterous
manipulation and dynamic kicking tasks. The entire system is fully reproducible
and open-sourced at https://yanjieze.com/TWIST2 . Our collected dataset is also
open-sourced at https://twist-data.github.io .
comment: Website: https://yanjieze.com/TWIST2
☆ Densemarks: Learning Canonical Embeddings for Human Heads Images via Point Tracks
We propose DenseMarks - a new learned representation for human heads,
enabling high-quality dense correspondences of human head images. For a 2D
image of a human head, a Vision Transformer network predicts a 3D embedding for
each pixel, which corresponds to a location in a 3D canonical unit cube. In
order to train our network, we collect a dataset of pairwise point matches,
estimated by a state-of-the-art point tracker over a collection of diverse
in-the-wild talking heads videos, and guide the mapping via a contrastive loss,
encouraging matched points to have close embeddings. We further employ
multi-task learning with face landmarks and segmentation constraints, as well
as imposing spatial continuity of embeddings through latent cube features,
which results in an interpretable and queryable canonical space. The
representation can be used for finding common semantic parts, face/head
tracking, and stereo reconstruction. Due to the strong supervision, our method
is robust to pose variations and covers the entire head, including hair.
Additionally, the canonical space bottleneck makes sure the obtained
representations are consistent across diverse poses and individuals. We
demonstrate state-of-the-art results in geometry-aware point matching and
monocular head tracking with 3D Morphable Models. The code and the model
checkpoint will be made available to the public.
comment: Project page: https://diddone.github.io/densemarks/ .Video:
https://youtu.be/o8DOOYFW0gI .21 pages, 13 figures, 2 tables
☆ PLUTO-4: Frontier Pathology Foundation Models
Harshith Padigela, Shima Nofallah, Atchuth Naveen Chilaparasetti, Ryun Han, Andrew Walker, Judy Shen, Chintan Shah, Blake Martin, Aashish Sood, Elliot Miller, Ben Glass, Andy Beck, Harsha Pokkalla, Syed Ashar Javed
Foundation models trained on large-scale pathology image corpora have
demonstrated strong transfer capabilities across diverse histopathology tasks.
Building on this progress, we introduce PLUTO-4, our next generation of
pathology foundation models that extend the Pathology-Universal Transformer
(PLUTO) to frontier scale. We share two complementary Vision Transformer
architectures in the PLUTO-4 family: a compact and efficient PLUTO-4S model
optimized for multi-scale deployment using a FlexiViT setup with 2D-RoPE
embeddings, and a frontier-scale PLUTO-4G model trained with a single patch
size to maximize representation capacity and stability. Both models are
pretrained using a self-supervised objective derived from DINOv2 on a large
multi-institutional corpus containing 551,164 WSIs from 137,144 patients across
over 50 institutions, spanning over 60 disease types and over 100 stains.
Comprehensive evaluation across public and internal benchmarks demonstrates
that PLUTO-4 achieves state-of-the-art performance on tasks requiring varying
spatial and biological context, including patch-level classification,
segmentation, and slide-level diagnosis. The compact PLUTO-4S provides
high-throughput and robust performance for practical deployment, while PLUTO-4G
establishes new performance frontiers across multiple pathology benchmarks,
including an 11% improvement in dermatopathology diagnosis. These diverse
improvements underscore PLUTO-4's potential to transform real-world
applications as a backbone for translational research and diagnostic use cases.
☆ AI-Generated Image Detection: An Empirical Study and Future Research Directions
The threats posed by AI-generated media, particularly deepfakes, are now
raising significant challenges for multimedia forensics, misinformation
detection, and biometric system resulting in erosion of public trust in the
legal system, significant increase in frauds, and social engineering attacks.
Although several forensic methods have been proposed, they suffer from three
critical gaps: (i) use of non-standardized benchmarks with GAN- or
diffusion-generated images, (ii) inconsistent training protocols (e.g.,
scratch, frozen, fine-tuning), and (iii) limited evaluation metrics that fail
to capture generalization and explainability. These limitations hinder fair
comparison, obscure true robustness, and restrict deployment in
security-critical applications. This paper introduces a unified benchmarking
framework for systematic evaluation of forensic methods under controlled and
reproducible conditions. We benchmark ten SoTA forensic methods (scratch,
frozen, and fine-tuned) and seven publicly available datasets (GAN and
diffusion) to perform extensive and systematic evaluations. We evaluate
performance using multiple metrics, including accuracy, average precision,
ROC-AUC, error rate, and class-wise sensitivity. We also further analyze model
interpretability using confidence curves and Grad-CAM heatmaps. Our evaluations
demonstrate substantial variability in generalization, with certain methods
exhibiting strong in-distribution performance but degraded cross-model
transferability. This study aims to guide the research community toward a
deeper understanding of the strengths and limitations of current forensic
approaches, and to inspire the development of more robust, generalizable, and
explainable solutions.
☆ When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought
Yiyang Zhou, Haoqin Tu, Zijun Wang, Zeyu Wang, Niklas Muennighoff, Fan Nie, Yejin Choi, James Zou, Chaorui Deng, Shen Yan, Haoqi Fan, Cihang Xie, Huaxiu Yao, Qinghao Ye
We propose MIRA, a new benchmark designed to evaluate models in scenarios
where generating intermediate visual images is essential for successful
reasoning. Unlike traditional CoT methods that rely solely on text, tasks in
MIRA require models to generate and utilize intermediate images - such as
sketches, structural diagrams, or path drawings - to guide their reasoning
process. This setup closely mirrors how humans solve complex problems through
"drawing to think". To solve this, MIRA focuses on tasks that are intrinsically
challenging and involve complex structures, spatial relationships, or reasoning
steps that are difficult to express through language alone. To ensure that our
evaluation data is of high-quality, we include 546 multimodal problems,
annotated with intermediate visual images and final answers. We also propose a
unified evaluation protocol for MIRA that spans three levels of evaluation
input: direct input with image and question only, text-only CoT input with
image and thinking prompts, and Visual-CoT input with both annotated image
clues and textual thinking prompts. To probe the upper bound of model capacity
on our benchmark, we also report pass@k and majority voting accuracies under
different k settings. Experimental results show that existing multimodal large
language models, including strongest private models as well as strong
open-weight models, perform poorly when relying solely on textual prompts.
However, when intermediate visual cues are provided, model performance improves
consistently, yielding an average relative gain of 33.7% across all models and
tasks. We also probe the upper bound by expanding the search space and
designing textual prompts aligned with Visual-CoT, but both yield only limited
improvements compared to our Visual-CoT setting. These results underscore the
critical role of imagined visual information in enabling successful reasoning
on MIRA.
comment: 28 pages, 15 figures
☆ VCode: a Multimodal Coding Benchmark with SVG as Symbolic Visual Representation
Kevin Qinghong Lin, Yuhao Zheng, Hangyu Ran, Dantong Zhu, Dongxing Mao, Linjie Li, Philip Torr, Alex Jinpeng Wang
Code has emerged as a precise and executable medium for reasoning and action
in the agent era. Yet, progress has largely focused on language-centric tasks
such as program synthesis and debugging, leaving visual-centric coding
underexplored. Inspired by how humans reason over sketches, we advocate SVG
code as a compact, interpretable, and executable visual representation. We
introduce VCode, a benchmark that reframes multimodal understanding as code
generation: given an image, a model must produce SVG that preserves symbolic
meaning for downstream reasoning. VCode covers three domains - general
commonsense (MM-Vet), professional disciplines (MMMU), and visual-centric
perception (CV-Bench). To assess symbolic fidelity, we propose CodeVQA, a novel
evaluation protocol in which a policy model answers questions over rendered
SVGs; correct answers indicate faithful symbolic preservation. Empirically,
frontier VLMs struggle to generate faithful SVGs, revealing a persistent gap
between language-centric and visual-centric coding. To close this gap, we
introduce VCoder, an agentic framework that augments VLMs along two axes: (i)
Thinking with Revision, which iteratively analyzes discrepancies and refines
SVG code; and (ii) Acting with Visual Tools, where detectors and parsers supply
structured cues such as objects, shapes, and text beyond the model's intrinsic
capacity. Across benchmarks, frontier VLMs with strong reasoning capabilities
score well overall yet remain limited in professional knowledge and 3D
reasoning. VCoder delivers a 12.3-point overall gain over the top-performing
Claude-4-Opus. Human studies show that both humans and VLMs perform worse on
rendered SVGs, their consistency reveals the promise of symbolic visual
representation. The benchmark and code are available at
https://github.com/CSU-JPG/VCode.
comment: Project page: https://csu-jpg.github.io/VCode Github:
https://github.com/CSU-JPG/VCode
☆ PercHead: Perceptual Head Model for Single-Image 3D Head Reconstruction & Editing
We present PercHead, a method for single-image 3D head reconstruction and
semantic 3D editing - two tasks that are inherently challenging due to severe
view occlusions, weak perceptual supervision, and the ambiguity of editing in
3D space. We develop a unified base model for reconstructing view-consistent 3D
heads from a single input image. The model employs a dual-branch encoder
followed by a ViT-based decoder that lifts 2D features into 3D space through
iterative cross-attention. Rendering is performed using Gaussian Splatting. At
the heart of our approach is a novel perceptual supervision strategy based on
DINOv2 and SAM2.1, which provides rich, generalized signals for both geometric
and appearance fidelity. Our model achieves state-of-the-art performance in
novel-view synthesis and, furthermore, exhibits exceptional robustness to
extreme viewing angles compared to established baselines. Furthermore, this
base model can be seamlessly extended for semantic 3D editing by swapping the
encoder and finetuning the network. In this variant, we disentangle geometry
and style through two distinct input modalities: a segmentation map to control
geometry and either a text prompt or a reference image to specify appearance.
We highlight the intuitive and powerful 3D editing capabilities of our model
through a lightweight, interactive GUI, where users can effortlessly sculpt
geometry by drawing segmentation maps and stylize appearance via natural
language or image prompts.
Project Page: https://antoniooroz.github.io/PercHead Video:
https://www.youtube.com/watch?v=4hFybgTk4kE
comment: Project Page: https://antoniooroz.github.io/PercHead/ Video:
https://www.youtube.com/watch?v=4hFybgTk4kE
☆ Dynamic Reflections: Probing Video Representations with Text Alignment
The alignment of representations from different modalities has recently been
shown to provide insights on the structural similarities and downstream
capabilities of different encoders across diverse data types. While significant
progress has been made in aligning images with text, the temporal nature of
video data remains largely unexplored in this context. In this work, we conduct
the first comprehensive study of video-text representation alignment, probing
the capabilities of modern video and language encoders. Our findings reveal
several key insights. First, we demonstrate that cross-modal alignment highly
depends on the richness of both visual (static images vs. multi-frame videos)
and text (single caption vs. a collection) data provided at test time,
especially when using state-of-the-art video encoders. We propose parametric
test-time scaling laws that capture this behavior and show remarkable
predictive power against empirical observations. Secondly, we investigate the
correlation between semantic alignment and performance on both semantic and
non-semantic downstream tasks, providing initial evidence that strong alignment
against text encoders may be linked to general-purpose video representation and
understanding. Finally, we correlate temporal reasoning with cross-modal
alignment providing a challenging test-bed for vision and language models.
Overall, our work introduces video-text alignment as an informative zero-shot
way to probe the representation power of different encoders for spatio-temporal
data. Project page can be found at https://video-prh.github.io/
comment: 21 pages, 12 figures
☆ LLEXICORP: End-user Explainability of Convolutional Neural Networks
Convolutional neural networks (CNNs) underpin many modern computer vision
systems. With applications ranging from common to critical areas, a need to
explain and understand the model and its decisions (XAI) emerged. Prior works
suggest that in the top layers of CNNs, the individual channels can be
attributed to classifying human-understandable concepts. Concept relevance
propagation (CRP) methods can backtrack predictions to these channels and find
images that most activate these channels. However, current CRP workflows are
largely manual: experts must inspect activation images to name the discovered
concepts and must synthesize verbose explanations from relevance maps, limiting
the accessibility of the explanations and their scalability.
To address these issues, we introduce Large Language model EXplaIns COncept
Relevance Propagation (LLEXICORP), a modular pipeline that couples CRP with a
multimodal large language model. Our approach automatically assigns descriptive
names to concept prototypes and generates natural-language explanations that
translate quantitative relevance distributions into intuitive narratives. To
ensure faithfulness, we craft prompts that teach the language model the
semantics of CRP through examples and enforce a separation between naming and
explanation tasks. The resulting text can be tailored to different audiences,
offering low-level technical descriptions for experts and high-level summaries
for non-technical stakeholders.
We qualitatively evaluate our method on various images from ImageNet on a
VGG16 model. Our findings suggest that integrating concept-based attribution
methods with large language models can significantly lower the barrier to
interpreting deep neural networks, paving the way for more transparent AI
systems.
☆ An unscented Kalman filter method for real time input-parameter-state estimation
The input-parameter-state estimation capabilities of a novel unscented Kalman
filter is examined herein on both linear and nonlinear systems. The unknown
input is estimated in two stages within each time step. Firstly, the predicted
dynamic states and the system parameters provide an estimation of the input.
Secondly, the corrected with measurements states and parameters provide a final
estimation. Importantly, it is demonstrated using the perturbation analysis
that, a system with at least a zero or a non-zero known input can potentially
be uniquely identified. This output-only methodology allows for a better
understanding of the system compared to classical output-only parameter
identification strategies, given that all the dynamic states, the parameters,
and the input are estimated jointly and in real-time.
comment: author-accepted manuscript (AAM) published in Mechanical Systems and
Signal Processing
☆ VidEmo: Affective-Tree Reasoning for Emotion-Centric Video Foundation Models
Understanding and predicting emotion from videos has gathered significant
attention in recent studies, driven by advancements in video large language
models (VideoLLMs). While advanced methods have made progress in video emotion
analysis, the intrinsic nature of emotions poses significant challenges.
Emotions are characterized by dynamic and cues-dependent properties, making it
difficult to understand complex and evolving emotional states with reasonable
rationale. To tackle these challenges, we propose a novel affective cues-guided
reasoning framework that unifies fundamental attribute perception, expression
analysis, and high-level emotional understanding in a stage-wise manner. At the
core of our approach is a family of video emotion foundation models (VidEmo),
specifically designed for emotion reasoning and instruction-following. These
models undergo a two-stage tuning process: first, curriculum emotion learning
for injecting emotion knowledge, followed by affective-tree reinforcement
learning for emotion reasoning. Moreover, we establish a foundational data
infrastructure and introduce a emotion-centric fine-grained dataset (Emo-CFG)
consisting of 2.1M diverse instruction-based samples. Emo-CFG includes
explainable emotional question-answering, fine-grained captions, and associated
rationales, providing essential resources for advancing emotion understanding
tasks. Experimental results demonstrate that our approach achieves competitive
performance, setting a new milestone across 15 face perception tasks.
comment: 41 pages, 26 figures
☆ Modality-Transition Representation Learning for Visible-Infrared Person Re-Identification
Visible-infrared person re-identification (VI-ReID) technique could associate
the pedestrian images across visible and infrared modalities in the practical
scenarios of background illumination changes. However, a substantial gap
inherently exists between these two modalities. Besides, existing methods
primarily rely on intermediate representations to align cross-modal features of
the same person. The intermediate feature representations are usually create by
generating intermediate images (kind of data enhancement), or fusing
intermediate features (more parameters, lack of interpretability), and they do
not make good use of the intermediate features. Thus, we propose a novel
VI-ReID framework via Modality-Transition Representation Learning (MTRL) with a
middle generated image as a transmitter from visible to infrared modals, which
are fully aligned with the original visible images and similar to the infrared
modality. After that, using a modality-transition contrastive loss and a
modality-query regularization loss for training, which could align the
cross-modal features more effectively. Notably, our proposed framework does not
need any additional parameters, which achieves the same inference speed to the
backbone while improving its performance on VI-ReID task. Extensive
experimental results illustrate that our model significantly and consistently
outperforms existing SOTAs on three typical VI-ReID datasets.
☆ Differentiable Hierarchical Visual Tokenization NeurIPS 2025
Vision Transformers rely on fixed patch tokens that ignore the spatial and
semantic structure of images. In this work, we introduce an end-to-end
differentiable tokenizer that adapts to image content with pixel-level
granularity while remaining backward-compatible with existing architectures for
retrofitting pretrained models. Our method uses hierarchical model selection
with information criteria to provide competitive performance in both
image-level classification and dense-prediction tasks, and even supports
out-of-the-box raster-to-vector conversion.
comment: NeurIPS 2025 Spotlight
☆ Can Visual Input Be Compressed? A Visual Token Compression Benchmark for Large Multimodal Models
Tianfan Peng, Yuntao Du, Pengzhou Ji, Shijie Dong, Kailin Jiang, Mingchuan Ma, Yijun Tian, Jinhe Bi, Qian Li, Wei Du, Feng Xiao, Lizhen Cui
Large multimodal models (LMMs) often suffer from severe inference
inefficiency due to the large number of visual tokens introduced by image
encoders. While recent token compression methods, such as pruning and merging,
have shown promise in reducing redundancy, their evaluation remains fragmented
and inconsistent. In this work, we present UniPruneBench, a unified and
extensible benchmark for visual token pruning in multimodal LLMs. UniPruneBench
provides standardized protocols across six ability dimensions and ten datasets,
covering ten representative compression algorithms and three families of LMMs
(LLaVA-v1.5, Intern-VL3, and Qwen2.5-VL). Beyond task accuracy, it incorporates
system-level metrics such as runtime and prefilling latency to provide a
holistic view. Our experiments uncover several key findings: (1) random pruning
is a surprisingly strong baseline, (2) no single method consistently
outperforms others across scenarios, (3) pruning sensitivity varies
significantly across tasks, with OCR being most vulnerable, and (4) pruning
ratio is the dominant factor governing performance degradation. We believe
UniPruneBench will serve as a reliable foundation for future research on
efficient multimodal modeling.
☆ Robust Face Liveness Detection for Biometric Authentication using Single Image
Biometric technologies are widely adopted in security, legal, and financial
systems. Face recognition can authenticate a person based on the unique facial
features such as shape and texture. However, recent works have demonstrated the
vulnerability of Face Recognition Systems (FRS) towards presentation attacks.
Using spoofing (aka.,presentation attacks), a malicious actor can get
illegitimate access to secure systems. This paper proposes a novel light-weight
CNN framework to identify print/display, video and wrap attacks. The proposed
robust architecture provides seamless liveness detection ensuring faster
biometric authentication (1-2 seconds on CPU). Further, this also presents a
newly created 2D spoof attack dataset consisting of more than 500 videos
collected from 60 subjects. To validate the effectiveness of this architecture,
we provide a demonstration video depicting print/display, video and wrap attack
detection approaches. The demo can be viewed in the following link:
https://rak.box.com/s/m1uf31fn5amtjp4mkgf1huh4ykfeibaa
☆ UniChange: Unifying Change Detection with Multimodal Large Language Model
Change detection (CD) is a fundamental task for monitoring and analyzing land
cover dynamics. While recent high performance models and high quality datasets
have significantly advanced the field, a critical limitation persists. Current
models typically acquire limited knowledge from single-type annotated data and
cannot concurrently leverage diverse binary change detection (BCD) and semantic
change detection (SCD) datasets. This constraint leads to poor generalization
and limited versatility. The recent advancements in Multimodal Large Language
Models (MLLMs) introduce new possibilities for a unified CD framework. We
leverage the language priors and unification capabilities of MLLMs to develop
UniChange, the first MLLM-based unified change detection model. UniChange
integrates generative language abilities with specialized CD functionalities.
Our model successfully unifies both BCD and SCD tasks through the introduction
of three special tokens: [T1], [T2], and [CHANGE]. Furthermore, UniChange
utilizes text prompts to guide the identification of change categories,
eliminating the reliance on predefined classification heads. This design allows
UniChange to effectively acquire knowledge from multi-source datasets, even
when their class definitions conflict. Experiments on four public benchmarks
(WHU-CD, S2Looking, LEVIR-CD+, and SECOND) demonstrate SOTA performance,
achieving IoU scores of 90.41, 53.04, 78.87, and 57.62, respectively,
surpassing all previous methods. The code is available at
https://github.com/Erxucomeon/UniChange.
☆ Zero-Shot Multi-Animal Tracking in the Wild
Multi-animal tracking is crucial for understanding animal ecology and
behavior. However, it remains a challenging task due to variations in habitat,
motion patterns, and species appearance. Traditional approaches typically
require extensive model fine-tuning and heuristic design for each application
scenario. In this work, we explore the potential of recent vision foundation
models for zero-shot multi-animal tracking. By combining a Grounding Dino
object detector with the Segment Anything Model 2 (SAM 2) tracker and carefully
designed heuristics, we develop a tracking framework that can be applied to new
datasets without any retraining or hyperparameter adaptation. Evaluations on
ChimpAct, Bird Flock Tracking, AnimalTrack, and a subset of GMOT-40 demonstrate
strong and consistent performance across diverse species and environments. The
code is available at https://github.com/ecker-lab/SAM2-Animal-Tracking.
☆ TAUE: Training-free Noise Transplant and Cultivation Diffusion Model
Despite the remarkable success of text-to-image diffusion models, their
output of a single, flattened image remains a critical bottleneck for
professional applications requiring layer-wise control. Existing solutions
either rely on fine-tuning with large, inaccessible datasets or are
training-free yet limited to generating isolated foreground elements, failing
to produce a complete and coherent scene. To address this, we introduce the
Training-free Noise Transplantation and Cultivation Diffusion Model (TAUE), a
novel framework for zero-shot, layer-wise image generation. Our core technique,
Noise Transplantation and Cultivation (NTC), extracts intermediate latent
representations from both foreground and composite generation processes,
transplanting them into the initial noise for subsequent layers. This ensures
semantic and structural coherence across foreground, background, and composite
layers, enabling consistent, multi-layered outputs without requiring
fine-tuning or auxiliary datasets. Extensive experiments show that our
training-free method achieves performance comparable to fine-tuned methods,
enhancing layer-wise consistency while maintaining high image quality and
fidelity. TAUE not only eliminates costly training and dataset requirements but
also unlocks novel downstream applications, such as complex compositional
editing, paving the way for more accessible and controllable generative
workflows.
comment: 13 pages, 8 figures, 3 tables. The first two authors contributed
equally. Project Page: https://iyatomilab.github.io/TAUE
☆ Resource-efficient Automatic Refinement of Segmentations via Weak Supervision from Light Feedback
Delineating anatomical regions is a key task in medical image analysis.
Manual segmentation achieves high accuracy but is labor-intensive and prone to
variability, thus prompting the development of automated approaches. Recently,
a breadth of foundation models has enabled automated segmentations across
diverse anatomies and imaging modalities, but these may not always meet the
clinical accuracy standards. While segmentation refinement strategies can
improve performance, current methods depend on heavy user interactions or
require fully supervised segmentations for training. Here, we present SCORE
(Segmentation COrrection from Regional Evaluations), a weakly supervised
framework that learns to refine mask predictions only using light feedback
during training. Specifically, instead of relying on dense training image
annotations, SCORE introduces a novel loss that leverages region-wise quality
scores and over/under-segmentation error labels. We demonstrate SCORE on
humerus CT scans, where it considerably improves initial predictions from
TotalSegmentator, and achieves performance on par with existing refinement
methods, while greatly reducing their supervision requirements and annotation
time. Our code is available at: https://gitlab.inria.fr/adelangl/SCORE.
☆ A Cognitive Process-Inspired Architecture for Subject-Agnostic Brain Visual Decoding
Subject-agnostic brain decoding, which aims to reconstruct continuous visual
experiences from fMRI without subject-specific training, holds great potential
for clinical applications. However, this direction remains underexplored due to
challenges in cross-subject generalization and the complex nature of brain
signals. In this work, we propose Visual Cortex Flow Architecture (VCFlow), a
novel hierarchical decoding framework that explicitly models the ventral-dorsal
architecture of the human visual system to learn multi-dimensional
representations. By disentangling and leveraging features from early visual
cortex, ventral, and dorsal streams, VCFlow captures diverse and complementary
cognitive information essential for visual reconstruction. Furthermore, we
introduce a feature-level contrastive learning strategy to enhance the
extraction of subject-invariant semantic representations, thereby enhancing
subject-agnostic applicability to previously unseen subjects. Unlike
conventional pipelines that need more than 12 hours of per-subject data and
heavy computation, VCFlow sacrifices only 7\% accuracy on average yet generates
each reconstructed video in 10 seconds without any retraining, offering a fast
and clinically scalable solution. The source code will be released upon
acceptance of the paper.
comment: 9 pages main text with 6 figures (excluding references),
supplementary material included
☆ Seeing Across Time and Views: Multi-Temporal Cross-View Learning for Robust Video Person Re-Identification
Video-based person re-identification (ReID) in cross-view domains (for
example, aerial-ground surveillance) remains an open problem because of extreme
viewpoint shifts, scale disparities, and temporal inconsistencies. To address
these challenges, we propose MTF-CVReID, a parameter-efficient framework that
introduces seven complementary modules over a ViT-B/16 backbone. Specifically,
we include: (1) Cross-Stream Feature Normalization (CSFN) to correct camera and
view biases; (2) Multi-Resolution Feature Harmonization (MRFH) for scale
stabilization across altitudes; (3) Identity-Aware Memory Module (IAMM) to
reinforce persistent identity traits; (4) Temporal Dynamics Modeling (TDM) for
motion-aware short-term temporal encoding; (5) Inter-View Feature Alignment
(IVFA) for perspective-invariant representation alignment; (6) Hierarchical
Temporal Pattern Learning (HTPL) to capture multi-scale temporal regularities;
and (7) Multi-View Identity Consistency Learning (MVICL) that enforces
cross-view identity coherence using a contrastive learning paradigm. Despite
adding only about 2 million parameters and 0.7 GFLOPs over the baseline,
MTF-CVReID maintains real-time efficiency (189 FPS) and achieves
state-of-the-art performance on the AG-VPReID benchmark across all altitude
levels, with strong cross-dataset generalization to G2A-VReID and MARS
datasets. These results show that carefully designed adapter-based modules can
substantially enhance cross-view robustness and temporal consistency without
compromising computational efficiency. The source code is available at
https://github.com/MdRashidunnabi/MTF-CVReID
☆ The Urban Vision Hackathon Dataset and Models: Towards Image Annotations and Accurate Vision Models for Indian Traffic
Akash Sharma, Chinmay Mhatre, Sankalp Gawali, Ruthvik Bokkasam, Brij Kishore, Vishwajeet Pattanaik, Tarun Rambha, Abdul R. Pinjari, Vijay Kovvali, Anirban Chakraborty, Punit Rathore, Raghu Krishnapuram, Yogesh Simmhan
This report describes the UVH-26 dataset, the first public release by
AIM@IISc of a large-scale dataset of annotated traffic-camera images from
India. The dataset comprises 26,646 high-resolution (1080p) images sampled from
2800 Bengaluru's Safe-City CCTV cameras over a 4-week period, and subsequently
annotated through a crowdsourced hackathon involving 565 college students from
across India. In total, 1.8 million bounding boxes were labeled across 14
vehicle classes specific to India: Cycle, 2-Wheeler (Motorcycle), 3-Wheeler
(Auto-rickshaw), LCV (Light Commercial Vehicles), Van, Tempo-traveller,
Hatchback, Sedan, SUV, MUV, Mini-bus, Bus, Truck and Other. Of these, 283k-316k
consensus ground truth bounding boxes and labels were derived for distinct
objects in the 26k images using Majority Voting and STAPLE algorithms. Further,
we train multiple contemporary detectors, including YOLO11-S/X, RT-DETR-S/X,
and DAMO-YOLO-T/L using these datasets, and report accuracy based on mAP50,
mAP75 and mAP50:95. Models trained on UVH-26 achieve 8.4-31.5% improvements in
mAP50:95 over equivalent baseline models trained on COCO dataset, with
RT-DETR-X showing the best performance at 0.67 (mAP50:95) as compared to 0.40
for COCO-trained weights for common classes (Car, Bus, and Truck). This
demonstrates the benefits of domain-specific training data for Indian traffic
scenarios. The release package provides the 26k images with consensus
annotations based on Majority Voting (UVH-26-MV) and STAPLE (UVH-26-ST) and the
6 fine-tuned YOLO and DETR models on each of these datasets. By capturing the
heterogeneity of Indian urban mobility directly from operational traffic-camera
streams, UVH-26 addresses a critical gap in existing global benchmarks, and
offers a foundation for advancing detection, classification, and deployment of
intelligent transportation systems in emerging nations with complex traffic
conditions.
☆ SigmaCollab: An Application-Driven Dataset for Physically Situated Collaboration
We introduce SigmaCollab, a dataset enabling research on physically situated
human-AI collaboration. The dataset consists of a set of 85 sessions in which
untrained participants were guided by a mixed-reality assistive AI agent in
performing procedural tasks in the physical world. SigmaCollab includes a set
of rich, multimodal data streams, such as the participant and system audio,
egocentric camera views from the head-mounted device, depth maps, head, hand
and gaze tracking information, as well as additional annotations performed
post-hoc. While the dataset is relatively small in size (~ 14 hours), its
application-driven and interactive nature brings to the fore novel research
challenges for human-AI collaboration, and provides more realistic testing
grounds for various AI models operating in this space. In future work, we plan
to use the dataset to construct a set of benchmarks for physically situated
collaboration in mixed-reality task assistive scenarios. SigmaCollab is
available at https://github.com/microsoft/SigmaCollab.
☆ Forecasting Future Anatomies: Longitudianl Brain Mri-to-Mri Prediction
Predicting future brain state from a baseline magnetic resonance image (MRI)
is a central challenge in neuroimaging and has important implications for
studying neurodegenerative diseases such as Alzheimer's disease (AD). Most
existing approaches predict future cognitive scores or clinical outcomes, such
as conversion from mild cognitive impairment to dementia. Instead, here we
investigate longitudinal MRI image-to-image prediction that forecasts a
participant's entire brain MRI several years into the future, intrinsically
modeling complex, spatially distributed neurodegenerative patterns. We
implement and evaluate five deep learning architectures (UNet, U2-Net, UNETR,
Time-Embedding UNet, and ODE-UNet) on two longitudinal cohorts (ADNI and AIBL).
Predicted follow-up MRIs are directly compared with the actual follow-up scans
using metrics that capture global similarity and local differences. The best
performing models achieve high-fidelity predictions, and all models generalize
well to an independent external dataset, demonstrating robust cross-cohort
performance. Our results indicate that deep learning can reliably predict
participant-specific brain MRI at the voxel level, offering new opportunities
for individualized prognosis.
☆ Unsupervised Learning for Industrial Defect Detection: A Case Study on Shearographic Data
Shearography is a non-destructive testing method for detecting subsurface
defects, offering high sensitivity and full-field inspection capabilities.
However, its industrial adoption remains limited due to the need for expert
interpretation. To reduce reliance on labeled data and manual evaluation, this
study explores unsupervised learning methods for automated anomaly detection in
shearographic images. Three architectures are evaluated: a fully connected
autoencoder, a convolutional autoencoder, and a student-teacher feature
matching model. All models are trained solely on defect-free data. A controlled
dataset was developed using a custom specimen with reproducible defect
patterns, enabling systematic acquisition of shearographic measurements under
both ideal and realistic deformation conditions. Two training subsets were
defined: one containing only undistorted, defect-free samples, and one
additionally including globally deformed, yet defect-free, data. The latter
simulates practical inspection conditions by incorporating deformation-induced
fringe patterns that may obscure localized anomalies. The models are evaluated
in terms of binary classification and, for the student-teacher model, spatial
defect localization. Results show that the student-teacher approach achieves
superior classification robustness and enables precise localization. Compared
to the autoencoder-based models, it demonstrates improved separability of
feature representations, as visualized through t-SNE embeddings. Additionally,
a YOLOv8 model trained on labeled defect data serves as a reference to
benchmark localization quality. This study underscores the potential of
unsupervised deep learning for scalable, label-efficient shearographic
inspection in industrial environments.
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI
International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18
DECEMBER 2025
☆ LiteVoxel: Low-memory Intelligent Thresholding for Efficient Voxel Rasterization
Sparse-voxel rasterization is a fast, differentiable alternative for
optimization-based scene reconstruction, but it tends to underfit low-frequency
content, depends on brittle pruning heuristics, and can overgrow in ways that
inflate VRAM. We introduce LiteVoxel, a self-tuning training pipeline that
makes SV rasterization both steadier and lighter. Our loss is made
low-frequency aware via an inverse-Sobel reweighting with a mid-training
gamma-ramp, shifting gradient budget to flat regions only after geometry
stabilize. Adaptation replaces fixed thresholds with a depth-quantile pruning
logic on maximum blending weight, stabilized by EMA-hysteresis guards and
refines structure through ray-footprint-based, priority-driven subdivision
under an explicit growth budget. Ablations and full-system results across
Mip-NeRF 360 (6scenes) and Tanks & Temples (3scenes) datasets show mitigation
of errors in low-frequency regions and boundary instability while keeping
PSNR/SSIM, training time, and FPS comparable to a strong SVRaster pipeline.
Crucially, LiteVoxel reduces peak VRAM by ~40%-60% and preserves low-frequency
detail that prior setups miss, enabling more predictable, memory-efficient
training without sacrificing perceptual quality.
☆ Keeping it Local, Tiny and Real: Automated Report Generation on Edge Computing Devices for Mechatronic-Based Cognitive Systems
Recent advancements in Deep Learning enable hardware-based cognitive systems,
that is, mechatronic systems in general and robotics in particular with
integrated Artificial Intelligence, to interact with dynamic and unstructured
environments. While the results are impressive, the application of such systems
to critical tasks like autonomous driving as well as service and care robotics
necessitate the evaluation of large amount of heterogeneous data. Automated
report generation for Mobile Robotics can play a crucial role in facilitating
the evaluation and acceptance of such systems in various domains. In this
paper, we propose a pipeline for generating automated reports in natural
language utilizing various multi-modal sensors that solely relies on local
models capable of being deployed on edge computing devices, thus preserving the
privacy of all actors involved and eliminating the need for external services.
In particular, we evaluate our implementation on a diverse dataset spanning
multiple domains including indoor, outdoor and urban environments, providing
quantitative as well as qualitative evaluation results. Various generated
example reports and other supplementary materials are available via a public
repository.
comment: 6 pages, 4 figures, 1 table; accepted for MECATRONICS-REM 2025
International Conference, PARIS, FRANCE December 3-5 2025
☆ ESA: Energy-Based Shot Assembly Optimization for Automatic Video Editing
Shot assembly is a crucial step in film production and video editing,
involving the sequencing and arrangement of shots to construct a narrative,
convey information, or evoke emotions. Traditionally, this process has been
manually executed by experienced editors. While current intelligent video
editing technologies can handle some automated video editing tasks, they often
fail to capture the creator's unique artistic expression in shot assembly.To
address this challenge, we propose an energy-based optimization method for
video shot assembly. Specifically, we first perform visual-semantic matching
between the script generated by a large language model and a video library to
obtain subsets of candidate shots aligned with the script semantics. Next, we
segment and label the shots from reference videos, extracting attributes such
as shot size, camera motion, and semantics. We then employ energy-based models
to learn from these attributes, scoring candidate shot sequences based on their
alignment with reference styles. Finally, we achieve shot assembly optimization
by combining multiple syntax rules, producing videos that align with the
assembly style of the reference videos. Our method not only automates the
arrangement and combination of independent shots according to specific logic,
narrative requirements, or artistic styles but also learns the assembly style
of reference videos, creating a coherent visual sequence or holistic visual
expression. With our system, even users with no prior video editing experience
can create visually compelling videos. Project page:
https://sobeymil.github.io/esa.com
☆ Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes
The automation of workflows in advanced microscopy is a key goal where
foundation models like Language Models (LLMs) and Vision-Language Models (VLMs)
show great potential. However, adapting these general-purpose models for
specialized scientific tasks is critical, and the optimal domain adaptation
strategy is often unclear. To address this, we introduce PtychoBench, a new
multi-modal, multi-task benchmark for ptychographic analysis. Using this
benchmark, we systematically compare two specialization strategies: Supervised
Fine-Tuning (SFT) and In-Context Learning (ICL). We evaluate these strategies
on a visual artifact detection task with VLMs and a textual parameter
recommendation task with LLMs in a data-scarce regime. Our findings reveal that
the optimal specialization pathway is task-dependent. For the visual task, SFT
and ICL are highly complementary, with a fine-tuned model guided by
context-aware examples achieving the highest mean performance (Micro-F1 of
0.728). Conversely, for the textual task, ICL on a large base model is the
superior strategy, reaching a peak Micro-F1 of 0.847 and outperforming a
powerful "super-expert" SFT model (0-shot Micro-F1 of 0.839). We also confirm
the superiority of context-aware prompting and identify a consistent contextual
interference phenomenon in fine-tuned models. These results, benchmarked
against strong baselines including GPT-4o and a DINOv3-based classifier, offer
key observations for AI in science: the optimal specialization path in our
benchmark is dependent on the task modality, offering a clear framework for
developing more effective science-based agentic systems.
☆ DetectiumFire: A Comprehensive Multi-modal Dataset Bridging Vision and Language for Fire Understanding NeurIPS 2025
Recent advances in multi-modal models have demonstrated strong performance in
tasks such as image generation and reasoning. However, applying these models to
the fire domain remains challenging due to the lack of publicly available
datasets with high-quality fire domain annotations. To address this gap, we
introduce DetectiumFire, a large-scale, multi-modal dataset comprising of 22.5k
high-resolution fire-related images and 2.5k real-world fire-related videos
covering a wide range of fire types, environments, and risk levels. The data
are annotated with both traditional computer vision labels (e.g., bounding
boxes) and detailed textual prompts describing the scene, enabling applications
such as synthetic data generation and fire risk reasoning. DetectiumFire offers
clear advantages over existing benchmarks in scale, diversity, and data
quality, significantly reducing redundancy and enhancing coverage of real-world
scenarios. We validate the utility of DetectiumFire across multiple tasks,
including object detection, diffusion-based image generation, and
vision-language reasoning. Our results highlight the potential of this dataset
to advance fire-related research and support the development of intelligent
safety systems. We release DetectiumFire to promote broader exploration of fire
understanding in the AI community. The dataset is available at
https://kaggle.com/datasets/38b79c344bdfc55d1eed3d22fbaa9c31fad45e27edbbe9e3c529d6e5c4f93890
comment: Advances in Neural Information Processing Systems 2025 (NeurIPS
2025), Poster, https://neurips.cc/virtual/2025/loc/san-diego/poster/121400
☆ Object Detection as an Optional Basis: A Graph Matching Network for Cross-View UAV Localization
With the rapid growth of the low-altitude economy, UAVs have become crucial
for measurement and tracking in patrol systems. However, in GNSS-denied areas,
satellite-based localization methods are prone to failure. This paper presents
a cross-view UAV localization framework that performs map matching via object
detection, aimed at effectively addressing cross-temporal, cross-view,
heterogeneous aerial image matching. In typical pipelines, UAV visual
localization is formulated as an image-retrieval problem: features are
extracted to build a localization map, and the pose of a query image is
estimated by matching it to a reference database with known poses. Because
publicly available UAV localization datasets are limited, many approaches
recast localization as a classification task and rely on scene labels in these
datasets to ensure accuracy. Other methods seek to reduce cross-domain
differences using polar-coordinate reprojection, perspective transformations,
or generative adversarial networks; however, they can suffer from misalignment,
content loss, and limited realism. In contrast, we leverage modern object
detection to accurately extract salient instances from UAV and satellite
images, and integrate a graph neural network to reason about inter-image and
intra-image node relationships. Using a fine-grained, graph-based
node-similarity metric, our method achieves strong retrieval and localization
performance. Extensive experiments on public and real-world datasets show that
our approach handles heterogeneous appearance differences effectively and
generalizes well, making it applicable to scenarios with larger modality gaps,
such as infrared-visible image matching. Our dataset will be publicly available
at the following URL: https://github.com/liutao23/ODGNNLoc.git.
comment: 20 pages, Submitted to IEEE TIM
☆ OLATverse: A Large-scale Real-world Object Dataset with Precise Lighting Control
Xilong Zhou, Jianchun Chen, Pramod Rao, Timo Teufel, Linjie Lyu, Tigran Minasian, Oleksandr Sotnychenko, Xiaoxiao Long, Marc Habermann, Christian Theobalt
We introduce OLATverse, a large-scale dataset comprising around 9M images of
765 real-world objects, captured from multiple viewpoints under a diverse set
of precisely controlled lighting conditions. While recent advances in
object-centric inverse rendering, novel view synthesis and relighting have
shown promising results, most techniques still heavily rely on the synthetic
datasets for training and small-scale real-world datasets for benchmarking,
which limits their realism and generalization. To address this gap, OLATverse
offers two key advantages over existing datasets: large-scale coverage of real
objects and high-fidelity appearance under precisely controlled illuminations.
Specifically, OLATverse contains 765 common and uncommon real-world objects,
spanning a wide range of material categories. Each object is captured using 35
DSLR cameras and 331 individually controlled light sources, enabling the
simulation of diverse illumination conditions. In addition, for each object, we
provide well-calibrated camera parameters, accurate object masks, photometric
surface normals, and diffuse albedo as auxiliary resources. We also construct
an extensive evaluation set, establishing the first comprehensive real-world
object-centric benchmark for inverse rendering and normal estimation. We
believe that OLATverse represents a pivotal step toward integrating the next
generation of inverse rendering and relighting methods with real-world data.
The full dataset, along with all post-processing workflows, will be publicly
released at https://vcai.mpi-inf.mpg.de/projects/OLATverse/.
☆ MVAFormer: RGB-based Multi-View Spatio-Temporal Action Recognition with Transformer ICIP2024
Multi-view action recognition aims to recognize human actions using multiple
camera views and deals with occlusion caused by obstacles or crowds. In this
task, cooperation among views, which generates a joint representation by
combining multiple views, is vital. Previous studies have explored promising
cooperation methods for improving performance. However, since their methods
focus only on the task setting of recognizing a single action from an entire
video, they are not applicable to the recently popular spatio-temporal action
recognition~(STAR) setting, in which each person's action is recognized
sequentially. To address this problem, this paper proposes a multi-view action
recognition method for the STAR setting, called MVAFormer. In MVAFormer, we
introduce a novel transformer-based cooperation module among views. In contrast
to previous studies, which utilize embedding vectors with lost spatial
information, our module utilizes the feature map for effective cooperation in
the STAR setting, which preserves the spatial information. Furthermore, in our
module, we divide the self-attention for the same and different views to model
the relationship between multiple views effectively. The results of experiments
using a newly collected dataset demonstrate that MVAFormer outperforms the
comparison baselines by approximately $4.4$ points on the F-measure.
comment: Selected as Best Industry Paper Award at ICIP2024
☆ HAGI++: Head-Assisted Gaze Imputation and Generation
Mobile eye tracking plays a vital role in capturing human visual attention
across both real-world and extended reality (XR) environments, making it an
essential tool for applications ranging from behavioural research to
human-computer interaction. However, missing values due to blinks, pupil
detection errors, or illumination changes pose significant challenges for
further gaze data analysis. To address this challenge, we introduce HAGI++ - a
multi-modal diffusion-based approach for gaze data imputation that, for the
first time, uses the integrated head orientation sensors to exploit the
inherent correlation between head and eye movements. HAGI++ employs a
transformer-based diffusion model to learn cross-modal dependencies between eye
and head representations and can be readily extended to incorporate additional
body movements. Extensive evaluations on the large-scale Nymeria, Ego-Exo4D,
and HOT3D datasets demonstrate that HAGI++ consistently outperforms
conventional interpolation methods and deep learning-based time-series
imputation baselines in gaze imputation. Furthermore, statistical analyses
confirm that HAGI++ produces gaze velocity distributions that closely match
actual human gaze behaviour, ensuring more realistic gaze imputations.
Moreover, by incorporating wrist motion captured from commercial wearable
devices, HAGI++ surpasses prior methods that rely on full-body motion capture
in the extreme case of 100% missing gaze data (pure gaze generation). Our
method paves the way for more complete and accurate eye gaze recordings in
real-world settings and has significant potential for enhancing gaze-based
analysis and interaction across various application domains.
comment: Extended version of our UIST'25 paper "HAGI: Head-Assisted Gaze
Imputation for Mobile Eye Trackers"
☆ KAO: Kernel-Adaptive Optimization in Diffusion for Satellite Image
Satellite image inpainting is a crucial task in remote sensing, where
accurately restoring missing or occluded regions is essential for robust image
analysis. In this paper, we propose KAO, a novel framework that utilizes
Kernel-Adaptive Optimization within diffusion models for satellite image
inpainting. KAO is specifically designed to address the challenges posed by
very high-resolution (VHR) satellite datasets, such as DeepGlobe and the
Massachusetts Roads Dataset. Unlike existing methods that rely on
preconditioned models requiring extensive retraining or postconditioned models
with significant computational overhead, KAO introduces a Latent Space
Conditioning approach, optimizing a compact latent space to achieve efficient
and accurate inpainting. Furthermore, we incorporate Explicit Propagation into
the diffusion process, facilitating forward-backward fusion, which improves the
stability and precision of the method. Experimental results demonstrate that
KAO sets a new benchmark for VHR satellite image restoration, providing a
scalable, high-performance solution that balances the efficiency of
preconditioned models with the flexibility of postconditioned models.
comment: 18 pages
☆ From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics
Video Understanding, Scene Interpretation and Commonsense Reasoning are
highly challenging tasks enabling the interpretation of visual information,
allowing agents to perceive, interact with and make rational decisions in its
environment. Large Language Models (LLMs) and Visual Language Models (VLMs)
have shown remarkable advancements in these areas in recent years, enabling
domain-specific applications as well as zero-shot open vocabulary tasks,
combining multiple domains. However, the required computational complexity
poses challenges for their application on edge devices and in the context of
Mobile Robotics, especially considering the trade-off between accuracy and
inference time. In this paper, we investigate the capabilities of
state-of-the-art VLMs for the task of Scene Interpretation and Action
Recognition, with special regard to small VLMs capable of being deployed to
edge devices in the context of Mobile Robotics. The proposed pipeline is
evaluated on a diverse dataset consisting of various real-world cityscape,
on-campus and indoor scenarios. The experimental evaluation discusses the
potential of these small models on edge devices, with particular emphasis on
challenges, weaknesses, inherent model biases and the application of the gained
information. Supplementary material is provided via the following repository:
https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
comment: 15 pages, 6 figures, 1 table; accepted for AI-2025 Forty-fifth SGAI
International Conference on Artificial Intelligence CAMBRIDGE, ENGLAND 16-18
DECEMBER 2025
☆ A Kullback-Leibler divergence method for input-system-state identification
The capability of a novel Kullback-Leibler divergence method is examined
herein within the Kalman filter framework to select the input-parameter-state
estimation execution with the most plausible results. This identification
suffers from the uncertainty related to obtaining different results from
different initial parameter set guesses, and the examined approach uses the
information gained from the data in going from the prior to the posterior
distribution to address the issue. Firstly, the Kalman filter is performed for
a number of different initial parameter sets providing the system
input-parameter-state estimation. Secondly, the resulting posterior
distributions are compared simultaneously to the initial prior distributions
using the Kullback-Leibler divergence. Finally, the identification with the
least Kullback-Leibler divergence is selected as the one with the most
plausible results. Importantly, the method is shown to select the better
performed identification in linear, nonlinear, and limited information
applications, providing a powerful tool for system monitoring.
comment: 32 pages, 17 figures, published in Journal of Sound and Vibration
☆ Synthetic Crop-Weed Image Generation and its Impact on Model Generalization
Precise semantic segmentation of crops and weeds is necessary for
agricultural weeding robots. However, training deep learning models requires
large annotated datasets, which are costly to obtain in real fields. Synthetic
data can reduce this burden, but the gap between simulated and real images
remains a challenge. In this paper, we present a pipeline for procedural
generation of synthetic crop-weed images using Blender, producing annotated
datasets under diverse conditions of plant growth, weed density, lighting, and
camera angle. We benchmark several state-of-the-art segmentation models on
synthetic and real datasets and analyze their cross-domain generalization. Our
results show that training on synthetic images leads to a sim-to-real gap of
10%, surpassing previous state-of-the-art methods. Moreover, synthetic data
demonstrates good generalization properties, outperforming real datasets in
cross-domain scenarios. These findings highlight the potential of synthetic
agricultural datasets and support hybrid strategies for more efficient model
training.
☆ ChartM$^3$: A Multi-Stage Code-Driven Pipeline for Constructing Multi-Dimensional and Multi-Step Visual Reasoning Data in Chart Comprehension EMNLP25
Complex chart understanding tasks demand advanced visual recognition and
reasoning capabilities from multimodal large language models (MLLMs). However,
current research provides limited coverage of complex chart scenarios and
computation-intensive reasoning tasks prevalent in real-world applications.
This study proposes an automated multi-stage code-driven pipeline for
systematically generating visual reasoning datasets to address these
limitations. The pipeline integrates retrieval-augmented generation (RAG) to
retrieve professional chart templates and employs chain-of-thought (CoT)
strategies to generate reasoning codes that simulate real data distributions,
thereby driving chart rendering and question-related statistical computations.
Through model-based evaluation, the pipeline enhances chart diversity and data
quality. Using this framework, we construct ChartM$^3$, a multi-dimensional and
multi-step dataset containing 38K charts and 142K Q&A pairs for training, along
with 2,871 high-quality evaluation samples for enabling practical performance
assessment. Supervised fine-tuning (SFT) and reinforcement learning (RL)
experiments demonstrate that our dataset significantly improves reasoning
capabilities and cross-domain generalization performance, enabling smaller
models to achieve performance comparable to larger-scale models in complex
chart comprehension.
comment: 23 pages, EMNLP25 Accepted
☆ IllumFlow: Illumination-Adaptive Low-Light Enhancement via Conditional Rectified Flow and Retinex Decomposition
We present IllumFlow, a novel framework that synergizes conditional Rectified
Flow (CRF) with Retinex theory for low-light image enhancement (LLIE). Our
model addresses low-light enhancement through separate optimization of
illumination and reflectance components, effectively handling both lighting
variations and noise. Specifically, we first decompose an input image into
reflectance and illumination components following Retinex theory. To model the
wide dynamic range of illumination variations in low-light images, we propose a
conditional rectified flow framework that represents illumination changes as a
continuous flow field. While complex noise primarily resides in the reflectance
component, we introduce a denoising network, enhanced by flow-derived data
augmentation, to remove reflectance noise and chromatic aberration while
preserving color fidelity. IllumFlow enables precise illumination adaptation
across lighting conditions while naturally supporting customizable brightness
enhancement. Extensive experiments on low-light enhancement and exposure
correction demonstrate superior quantitative and qualitative performance over
existing methods.
☆ Purrturbed but Stable: Human-Cat Invariant Representations Across CNNs, ViTs and Self-Supervised ViTs
Cats and humans differ in ocular anatomy. Most notably, Felis Catus (domestic
cats) have vertically elongated pupils linked to ambush predation; yet, how
such specializations manifest in downstream visual representations remains
incompletely understood. We present a unified, frozen-encoder benchmark that
quantifies feline-human cross-species representational alignment in the wild,
across convolutional networks, supervised Vision Transformers, windowed
transformers, and self-supervised ViTs (DINO), using layer-wise Centered Kernel
Alignment (linear and RBF) and Representational Similarity Analysis, with
additional distributional and stability tests reported in the paper. Across
models, DINO ViT-B/16 attains the most substantial alignment (mean CKA-RBF
$\approx0.814$, mean CKA-linear $\approx0.745$, mean RSA $\approx0.698$),
peaking at early blocks, indicating that token-level self-supervision induces
early-stage features that bridge species-specific statistics. Supervised ViTs
are competitive on CKA yet show weaker geometric correspondence than DINO
(e.g., ViT-B/16 RSA $\approx0.53$ at block8; ViT-L/16 $\approx0.47$ at
block14), revealing depth-dependent divergences between similarity and
representational geometry. CNNs remain strong baselines but below plain ViTs on
alignment, and windowed transformers underperform plain ViTs, implicating
architectural inductive biases in cross-species alignment. Results indicate
that self-supervision coupled with ViT inductive biases yields representational
geometries that more closely align feline and human visual systems than widely
used CNNs and windowed Transformers, providing testable neuroscientific
hypotheses about where and how cross-species visual computations converge. We
release our code and dataset for reference and reproducibility.
☆ MammoClean: Toward Reproducible and Bias-Aware AI in Mammography through Dataset Harmonization
The development of clinically reliable artificial intelligence (AI) systems
for mammography is hindered by profound heterogeneity in data quality, metadata
standards, and population distributions across public datasets. This
heterogeneity introduces dataset-specific biases that severely compromise the
generalizability of the model, a fundamental barrier to clinical deployment. We
present MammoClean, a public framework for standardization and bias
quantification in mammography datasets. MammoClean standardizes case selection,
image processing (including laterality and intensity correction), and unifies
metadata into a consistent multi-view structure. We provide a comprehensive
review of breast anatomy, imaging characteristics, and public mammography
datasets to systematically identify key sources of bias. Applying MammoClean to
three heterogeneous datasets (CBIS-DDSM, TOMPEI-CMMD, VinDr-Mammo), we quantify
substantial distributional shifts in breast density and abnormality prevalence.
Critically, we demonstrate the direct impact of data corruption: AI models
trained on corrupted datasets exhibit significant performance degradation
compared to their curated counterparts. By using MammoClean to identify and
mitigate bias sources, researchers can construct unified multi-dataset training
corpora that enable development of robust models with superior cross-domain
generalization. MammoClean provides an essential, reproducible pipeline for
bias-aware AI development in mammography, facilitating fairer comparisons and
advancing the creation of safe, effective systems that perform equitably across
diverse patient populations and clinical settings. The open-source code is
publicly available from: https://github.com/Minds-R-Lab/MammoClean.
☆ A Novel Grouping-Based Hybrid Color Correction Algorithm for Color Point Clouds
Color consistency correction for color point clouds is a fundamental yet
important task in 3D rendering and compression applications. In the past, most
previous color correction methods aimed at correcting color for color images.
The purpose of this paper is to propose a grouping-based hybrid color
correction algorithm for color point clouds. Our algorithm begins by estimating
the overlapping rate between the aligned source and target point clouds, and
then adaptively partitions the target points into two groups, namely the close
proximity group Gcl and the moderate proximity group Gmod, or three groups,
namely Gcl, Gmod, and the distant proximity group Gdist, when the estimated
overlapping rate is low or high, respectively. To correct color for target
points in Gcl, a K-nearest neighbors based bilateral interpolation (KBI) method
is proposed. To correct color for target points in Gmod, a joint KBI and the
histogram equalization (JKHE) method is proposed. For target points in Gdist, a
histogram equalization (HE) method is proposed for color correction. Finally,
we discuss the grouping-effect free property and the ablation study in our
algorithm. The desired color consistency correction benefit of our algorithm
has been justified through 1086 testing color point cloud pairs against the
state-of-the-art methods. The C++ source code of our algorithm can be accessed
from the website: https://github.com/ivpml84079/Point-cloud-color-correction.
☆ Self-Supervised Moving Object Segmentation of Sparse and Noisy Radar Point Clouds SC 2025
Leon Schwarzer, Matthias Zeller, Daniel Casado Herraez, Simon Dierl, Michael Heidingsfeld, Cyrill Stachniss
Moving object segmentation is a crucial task for safe and reliable autonomous
mobile systems like self-driving cars, improving the reliability and robustness
of subsequent tasks like SLAM or path planning. While the segmentation of
camera or LiDAR data is widely researched and achieves great results, it often
introduces an increased latency by requiring the accumulation of temporal
sequences to gain the necessary temporal context. Radar sensors overcome this
problem with their ability to provide a direct measurement of a point's Doppler
velocity, which can be exploited for single-scan moving object segmentation.
However, radar point clouds are often sparse and noisy, making data annotation
for use in supervised learning very tedious, time-consuming, and
cost-intensive. To overcome this problem, we address the task of
self-supervised moving object segmentation of sparse and noisy radar point
clouds. We follow a two-step approach of contrastive self-supervised
representation learning with subsequent supervised fine-tuning using limited
amounts of annotated data. We propose a novel clustering-based contrastive loss
function with cluster refinement based on dynamic points removal to pretrain
the network to produce motion-aware representations of the radar data. Our
method improves label efficiency after fine-tuning, effectively boosting
state-of-the-art performance by self-supervised pretraining.
comment: Accepted for publication at IEEE International Conference on
Intelligent Transportation Systems (ITSC 2025), 8 pages, 3 figures
☆ RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning
Jiahe Song, Chuang Wang, Bowen Jiang, Yinfan Wang, Hao Zheng, Xingjian Wei, Chengjin Liu, Junyuan Gao, Yubin Wang, Lijun Wu, Jiang Wu, Qian Yu, Conghui He
Large-scale chemical reaction datasets are crucial for AI research in
chemistry. However, existing chemical reaction data often exist as images
within papers, making them not machine-readable and unusable for training
machine learning models. In response to this challenge, we propose the
RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP).
Our framework reformulates the traditional coordinate prediction driven parsing
process into an image captioning problem, which Large Vision-Language Models
(LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as
Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector,
MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the
input image. This turns the downstream parsing into a natural-language
description problem. Extensive experiments show that the BIVP strategy
significantly improves structural extraction quality while simplifying model
design. We further construct the RxnCaption-11k dataset, an order of magnitude
larger than prior real-world literature benchmarks, with a balanced test subset
across four layout archetypes. Experiments demonstrate that RxnCaption-VL
achieves state-of-the-art performance on multiple metrics. We believe our
method, dataset, and models will advance structured information extraction from
chemical literature and catalyze broader AI applications in chemistry. We will
release data, models, and code on GitHub.
☆ CoCoVa: Chain of Continuous Vision-Language Thought for Latent Space Reasoning
In human cognition, there exist numerous thought processes that are tacit and
beyond verbal expression, enabling us to understand and interact with the world
in multiple ways. However, contemporary Vision-Language Models (VLMs) remain
constrained to reasoning within the discrete and rigid space of linguistic
tokens, thereby bottlenecking the rich, high-dimensional nature of visual
perception. To bridge this gap, we propose CoCoVa (Chain of Continuous
Vision-Language Thought), a novel framework for vision-language model that
leverages continuous cross-modal reasoning for diverse vision-language tasks.
The core of CoCoVa is an iterative reasoning cycle, where a novel Latent
Q-Former (LQ-Former) acts as a dynamic reasoning engine, iteratively refining a
chain of latent thought vectors through cross-modal fusion. To focus this
process, a token selection mechanism dynamically identifies salient visual
regions, mimicking attentional focus. To ensure these latent thoughts remain
grounded, we train the model with a multi-task objective that combines
contrastive learning and diffusion-based reconstruction, enforcing alignment
between latent representations and both visual and textual modalities.
Evaluations show CoCoVa improves accuracy and token efficiency over strong
baselines. With a 1.5B backbone, it competes with or surpasses larger 7B-9B
models on almost all benchmarks. When scaled to 7B LLM backbones, it remains
competitive with state-of-the-art models. Qualitative analysis validates that
learned latent space captures interpretable and structured reasoning patterns,
highlighting the potential of CoCoVa to bridge the representational gap between
discrete language processing and the continuous nature of visual understanding.
☆ M3PD Dataset: Dual-view Photoplethysmography (PPG) Using Front-and-rear Cameras of Smartphones in Lab and Clinical Settings
Jiankai Tang, Tao Zhang, Jia Li, Yiru Zhang, Mingyu Zhang, Kegang Wang, Yuming Hao, Bolin Wang, Haiyang Li, Xingyao Wang, Yuanchun Shi, Yuntao Wang, Sichong Qian
Portable physiological monitoring is essential for early detection and
management of cardiovascular disease, but current methods often require
specialized equipment that limits accessibility or impose impractical postures
that patients cannot maintain. Video-based photoplethysmography on smartphones
offers a convenient noninvasive alternative, yet it still faces reliability
challenges caused by motion artifacts, lighting variations, and single-view
constraints. Few studies have demonstrated reliable application to
cardiovascular patients, and no widely used open datasets exist for
cross-device accuracy. To address these limitations, we introduce the M3PD
dataset, the first publicly available dual-view mobile photoplethysmography
dataset, comprising synchronized facial and fingertip videos captured
simultaneously via front and rear smartphone cameras from 60 participants
(including 47 cardiovascular patients). Building on this dual-view setting, we
further propose F3Mamba, which fuses the facial and fingertip views through
Mamba-based temporal modeling. The model reduces heart-rate error by 21.9 to
30.2 percent over existing single-view baselines while improving robustness in
challenging real-world scenarios. Data and code:
https://github.com/Health-HCI-Group/F3Mamba.
☆ GAFD-CC: Global-Aware Feature Decoupling with Confidence Calibration for OOD Detection
Out-of-distribution (OOD) detection is paramount to ensuring the reliability
and robustness of learning models in real-world applications. Existing post-hoc
OOD detection methods detect OOD samples by leveraging their features and
logits information without retraining. However, they often overlook the
inherent correlation between features and logits, which is crucial for
effective OOD detection. To address this limitation, we propose Global-Aware
Feature Decoupling with Confidence Calibration (GAFD-CC). GAFD-CC aims to
refine decision boundaries and increase discriminative performance. Firstly, it
performs global-aware feature decoupling guided by classification weights. This
involves aligning features with the direction of global classification weights
to decouple them. From this, GAFD-CC extracts two types of critical
information: positively correlated features that promote in-distribution
(ID)/OOD boundary refinement and negatively correlated features that suppress
false positives and tighten these boundaries. Secondly, it adaptively fuses
these decoupled features with multi-scale logit-based confidence for
comprehensive and robust OOD detection. Extensive experiments on large-scale
benchmarks demonstrate GAFD-CC's competitive performance and strong
generalization ability compared to those of state-of-the-art methods.
☆ Cycle-Sync: Robust Global Camera Pose Estimation through Enhanced Cycle-Consistent Synchronization NeurIPS 2025
We introduce Cycle-Sync, a robust and global framework for estimating camera
poses (both rotations and locations). Our core innovation is a location solver
that adapts message-passing least squares (MPLS) -- originally developed for
group synchronization -- to camera location estimation. We modify MPLS to
emphasize cycle-consistent information, redefine cycle consistencies using
estimated distances from previous iterations, and incorporate a Welsch-type
robust loss. We establish the strongest known deterministic exact-recovery
guarantee for camera location estimation, showing that cycle consistency alone
-- without access to inter-camera distances -- suffices to achieve the lowest
sample complexity currently known. To further enhance robustness, we introduce
a plug-and-play outlier rejection module inspired by robust subspace recovery,
and we fully integrate cycle consistency into MPLS for rotation
synchronization. Our global approach avoids the need for bundle adjustment.
Experiments on synthetic and real datasets show that Cycle-Sync consistently
outperforms leading pose estimators, including full structure-from-motion
pipelines with bundle adjustment.
comment: NeurIPS 2025 spotlight paper
☆ 3D Point Cloud Object Detection on Edge Devices for Split Computing
The field of autonomous driving technology is rapidly advancing, with deep
learning being a key component. Particularly in the field of sensing, 3D point
cloud data collected by LiDAR is utilized to run deep neural network models for
3D object detection. However, these state-of-the-art models are complex,
leading to longer processing times and increased power consumption on edge
devices. The objective of this study is to address these issues by leveraging
Split Computing, a distributed machine learning inference method. Split
Computing aims to lessen the computational burden on edge devices, thereby
reducing processing time and power consumption. Furthermore, it minimizes the
risk of data breaches by only transmitting intermediate data from the deep
neural network model. Experimental results show that splitting after
voxelization reduces the inference time by 70.8% and the edge device execution
time by 90.0%. When splitting within the network, the inference time is reduced
by up to 57.1%, and the edge device execution time is reduced by up to 69.5%.
comment: 6 pages. This version includes minor lstlisting configuration
adjustments for successful compilation. No changes to content or layout.
Originally published at ACM/IEEE RAGE 2024
☆ Link prediction Graph Neural Networks for structure recognition of Handwritten Mathematical Expressions ICDAR2025
We propose a Graph Neural Network (GNN)-based approach for Handwritten
Mathematical Expression (HME) recognition by modeling HMEs as graphs, where
nodes represent symbols and edges capture spatial dependencies. A deep BLSTM
network is used for symbol segmentation, recognition, and spatial relation
classification, forming an initial primitive graph. A 2D-CFG parser then
generates all possible spatial relations, while the GNN-based link prediction
model refines the structure by removing unnecessary connections, ultimately
forming the Symbol Label Graph. Experimental results demonstrate the
effectiveness of our approach, showing promising performance in HME structure
recognition.
comment: accepted for ICDAR2025-WML
☆ SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
Fangxun Shu, Yongjie Ye, Yue Liao, Zijian Kang, Weijie Yin, Jiacong Wang, Xiao Liang, Shuicheng Yan, Chao Feng
We introduce SAIL-RL, a reinforcement learning (RL) post-training framework
that enhances the reasoning capabilities of multimodal large language models
(MLLMs) by teaching them when and how to think. Existing approaches are limited
by outcome-only supervision, which rewards correct answers without ensuring
sound reasoning, and by uniform thinking strategies, which often lead to
overthinking on simple tasks and underthinking on complex ones. SAIL-RL
addresses these challenges with a dual reward system: the Thinking Reward,
which evaluates reasoning quality through factual grounding, logical coherence,
and answer consistency, and the Judging Reward, which adaptively determines
whether deep reasoning or direct answering is appropriate. Experiments on the
state-of-the-art SAIL-VL2 show that SAIL-RL improves reasoning and multimodal
understanding benchmarks at both 4B and 8B scales, achieving competitive
performance against commercial closed-source models such as GPT-4o, and
substantially reduces hallucinations, establishing it as a principled framework
for building more reliable and adaptive MLLMs. The code will be available at
https://github.com/BytedanceDouyinContent/SAIL-RL.
☆ Are Euler angles a useful rotation parameterisation for pose estimation with Normalizing Flows? BMVC 2025
Giorgos Sfikas, Konstantina Nikolaidou, Foteini Papadopoulou, George Retsinas, Anastasios L. Kesidis
Object pose estimation is a task that is of central importance in 3D Computer
Vision. Given a target image and a canonical pose, a single point estimate may
very often be sufficient; however, a probabilistic pose output is related to a
number of benefits when pose is not unambiguous due to sensor and projection
constraints or inherent object symmetries. With this paper, we explore the
usefulness of using the well-known Euler angles parameterisation as a basis for
a Normalizing Flows model for pose estimation. Isomorphic to spatial rotation,
3D pose has been parameterized in a number of ways, either in or out of the
context of parameter estimation. We explore the idea that Euler angles, despite
their shortcomings, may lead to useful models in a number of aspects, compared
to a model built on a more complex parameterisation.
comment: BMVC 2025 workshop proceedings (Smart Cameras for Smarter Autonomous
Vehicles & Robots)
☆ Medical Report Generation: A Hierarchical Task Structure-Based Cross-Modal Causal Intervention Framework
Medical Report Generation (MRG) is a key part of modern medical diagnostics,
as it automatically generates reports from radiological images to reduce
radiologists' burden. However, reliable MRG models for lesion description face
three main challenges: insufficient domain knowledge understanding, poor
text-visual entity embedding alignment, and spurious correlations from
cross-modal biases. Previous work only addresses single challenges, while this
paper tackles all three via a novel hierarchical task decomposition approach,
proposing the HTSC-CIF framework. HTSC-CIF classifies the three challenges into
low-, mid-, and high-level tasks: 1) Low-level: align medical entity features
with spatial locations to enhance domain knowledge for visual encoders; 2)
Mid-level: use Prefix Language Modeling (text) and Masked Image Modeling
(images) to boost cross-modal alignment via mutual guidance; 3) High-level: a
cross-modal causal intervention module (via front-door intervention) to reduce
confounders and improve interpretability. Extensive experiments confirm
HTSC-CIF's effectiveness, significantly outperforming state-of-the-art (SOTA)
MRG methods. Code will be made public upon paper acceptance.
☆ Monocular absolute depth estimation from endoscopy via domain-invariant feature learning and latent consistency
Hao Li, Daiwei Lu, Jesse d'Almeida, Dilara Isik, Ehsan Khodapanah Aghdam, Nick DiSanto, Ayberk Acar, Susheela Sharma, Jie Ying Wu, Robert J. Webster III, Ipek Oguz
Monocular depth estimation (MDE) is a critical task to guide autonomous
medical robots. However, obtaining absolute (metric) depth from an endoscopy
camera in surgical scenes is difficult, which limits supervised learning of
depth on real endoscopic images. Current image-level unsupervised domain
adaptation methods translate synthetic images with known depth maps into the
style of real endoscopic frames and train depth networks using these translated
images with their corresponding depth maps. However a domain gap often remains
between real and translated synthetic images. In this paper, we present a
latent feature alignment method to improve absolute depth estimation by
reducing this domain gap in the context of endoscopic videos of the central
airway. Our methods are agnostic to the image translation process and focus on
the depth estimation itself. Specifically, the depth network takes translated
synthetic and real endoscopic frames as input and learns latent
domain-invariant features via adversarial learning and directional feature
consistency. The evaluation is conducted on endoscopic videos of central airway
phantoms with manually aligned absolute depth maps. Compared to
state-of-the-art MDE methods, our approach achieves superior performance on
both absolute and relative depth metrics, and consistently improves results
across various backbones and pretrained weights. Our code is available at
https://github.com/MedICL-VU/MDE.
☆ Collaborative Attention and Consistent-Guided Fusion of MRI and PET for Alzheimer's Disease Diagnosis
Alzheimer's disease (AD) is the most prevalent form of dementia, and its
early diagnosis is essential for slowing disease progression. Recent studies on
multimodal neuroimaging fusion using MRI and PET have achieved promising
results by integrating multi-scale complementary features. However, most
existing approaches primarily emphasize cross-modal complementarity while
overlooking the diagnostic importance of modality-specific features. In
addition, the inherent distributional differences between modalities often lead
to biased and noisy representations, degrading classification performance. To
address these challenges, we propose a Collaborative Attention and
Consistent-Guided Fusion framework for MRI and PET based AD diagnosis. The
proposed model introduces a learnable parameter representation (LPR) block to
compensate for missing modality information, followed by a shared encoder and
modality-independent encoders to preserve both shared and specific
representations. Furthermore, a consistency-guided mechanism is employed to
explicitly align the latent distributions across modalities. Experimental
results on the ADNI dataset demonstrate that our method achieves superior
diagnostic performance compared with existing fusion strategies.
☆ Can Foundation Models Revolutionize Mobile AR Sparse Sensing?
Mobile sensing systems have long faced a fundamental trade-off between
sensing quality and efficiency due to constraints in computation, power, and
other limitations. Sparse sensing, which aims to acquire and process only a
subset of sensor data, has been a key strategy for maintaining performance
under such constraints. However, existing sparse sensing methods often suffer
from reduced accuracy, as missing information across space and time introduces
uncertainty into many sensing systems. In this work, we investigate whether
foundation models can change the landscape of mobile sparse sensing. Using
real-world mobile AR data, our evaluations demonstrate that foundation models
offer significant improvements in geometry-aware image warping, a central
technique for enabling accurate reuse of cross-frame information. Furthermore,
our study demonstrates the scalability of foundation model-based sparse sensing
and shows its leading performance in 3D scene reconstruction. Collectively, our
study reveals critical aspects of the promises and the open challenges of
integrating foundation models into mobile sparse sensing systems.
☆ High-Resolution Magnetic Particle Imaging System Matrix Recovery Using a Vision Transformer with Residual Feature Network
This study presents a hybrid deep learning framework, the Vision Transformer
with Residual Feature Network (VRF-Net), for recovering high-resolution system
matrices in Magnetic Particle Imaging (MPI). MPI resolution often suffers from
downsampling and coil sensitivity variations. VRF-Net addresses these
challenges by combining transformer-based global attention with residual
convolutional refinement, enabling recovery of both large-scale structures and
fine details. To reflect realistic MPI conditions, the system matrix is
degraded using a dual-stage downsampling strategy. Training employed
paired-image super-resolution on the public Open MPI dataset and a simulated
dataset incorporating variable coil sensitivity profiles. For system matrix
recovery on the Open MPI dataset, VRF-Net achieved nRMSE = 0.403, pSNR = 39.08
dB, and SSIM = 0.835 at 2x scaling, and maintained strong performance even at
challenging scale 8x (pSNR = 31.06 dB, SSIM = 0.717). For the simulated
dataset, VRF-Net achieved nRMSE = 4.44, pSNR = 28.52 dB, and SSIM = 0.771 at 2x
scaling, with stable performance at higher scales. On average, it reduced nRMSE
by 88.2%, increased pSNR by 44.7%, and improved SSIM by 34.3% over
interpolation and CNN-based methods. In image reconstruction of Open MPI
phantoms, VRF-Net further reduced reconstruction error to nRMSE = 1.79 at 2x
scaling, while preserving structural fidelity (pSNR = 41.58 dB, SSIM = 0.960),
outperforming existing methods. These findings demonstrate that VRF-Net enables
sharper, artifact-free system matrix recovery and robust image reconstruction
across multiple scales, offering a promising direction for future in vivo
applications.
☆ Estimation of Segmental Longitudinal Strain in Transesophageal Echocardiography by Deep Learning
Anders Austlid Taskén, Thierry Judge, Erik Andreas Rye Berg, Jinyang Yu, Bjørnar Grenne, Frank Lindseth, Svend Aakhus, Pierre-Marc Jodoin, Nicolas Duchateau, Olivier Bernard, Gabriel Kiss
Segmental longitudinal strain (SLS) of the left ventricle (LV) is an
important prognostic indicator for evaluating regional LV dysfunction, in
particular for diagnosing and managing myocardial ischemia. Current techniques
for strain estimation require significant manual intervention and expertise,
limiting their efficiency and making them too resource-intensive for monitoring
purposes. This study introduces the first automated pipeline, autoStrain, for
SLS estimation in transesophageal echocardiography (TEE) using deep learning
(DL) methods for motion estimation. We present a comparative analysis of two DL
approaches: TeeFlow, based on the RAFT optical flow model for dense
frame-to-frame predictions, and TeeTracker, based on the CoTracker point
trajectory model for sparse long-sequence predictions.
As ground truth motion data from real echocardiographic sequences are hardly
accessible, we took advantage of a unique simulation pipeline (SIMUS) to
generate a highly realistic synthetic TEE (synTEE) dataset of 80 patients with
ground truth myocardial motion to train and evaluate both models. Our
evaluation shows that TeeTracker outperforms TeeFlow in accuracy, achieving a
mean distance error in motion estimation of 0.65 mm on a synTEE test dataset.
Clinical validation on 16 patients further demonstrated that SLS estimation
with our autoStrain pipeline aligned with clinical references, achieving a mean
difference (95\% limits of agreement) of 1.09% (-8.90% to 11.09%).
Incorporation of simulated ischemia in the synTEE data improved the accuracy of
the models in quantifying abnormal deformation. Our findings indicate that
integrating AI-driven motion estimation with TEE can significantly enhance the
precision and efficiency of cardiac function assessment in clinical settings.
comment: 13 pages, IEEE Journal of Biomedical and Health Informatics
☆ Object-Centric 3D Gaussian Splatting for Strawberry Plant Reconstruction and Phenotyping
Strawberries are among the most economically significant fruits in the United
States, generating over $2 billion in annual farm-gate sales and accounting for
approximately 13% of the total fruit production value. Plant phenotyping plays
a vital role in selecting superior cultivars by characterizing plant traits
such as morphology, canopy structure, and growth dynamics. However, traditional
plant phenotyping methods are time-consuming, labor-intensive, and often
destructive. Recently, neural rendering techniques, notably Neural Radiance
Fields (NeRF) and 3D Gaussian Splatting (3DGS), have emerged as powerful
frameworks for high-fidelity 3D reconstruction. By capturing a sequence of
multi-view images or videos around a target plant, these methods enable
non-destructive reconstruction of complex plant architectures. Despite their
promise, most current applications of 3DGS in agricultural domains reconstruct
the entire scene, including background elements, which introduces noise,
increases computational costs, and complicates downstream trait analysis. To
address this limitation, we propose a novel object-centric 3D reconstruction
framework incorporating a preprocessing pipeline that leverages the Segment
Anything Model v2 (SAM-2) and alpha channel background masking to achieve clean
strawberry plant reconstructions. This approach produces more accurate
geometric representations while substantially reducing computational time. With
a background-free reconstruction, our algorithm can automatically estimate
important plant traits, such as plant height and canopy width, using DBSCAN
clustering and Principal Component Analysis (PCA). Experimental results show
that our method outperforms conventional pipelines in both accuracy and
efficiency, offering a scalable and non-destructive solution for strawberry
plant phenotyping.
comment: 11 pages, 4 figures, 3 tables
☆ Language-Enhanced Generative Modeling for PET Synthesis from MRI and Blood Biomarkers
Background: Alzheimer's disease (AD) diagnosis heavily relies on amyloid-beta
positron emission tomography (Abeta-PET), which is limited by high cost and
limited accessibility. This study explores whether Abeta-PET spatial patterns
can be predicted from blood-based biomarkers (BBMs) and MRI scans. Methods: We
collected Abeta-PET images, T1-weighted MRI scans, and BBMs from 566
participants. A language-enhanced generative model, driven by a large language
model (LLM) and multimodal information fusion, was developed to synthesize PET
images. Synthesized images were evaluated for image quality, diagnostic
consistency, and clinical applicability within a fully automated diagnostic
pipeline. Findings: The synthetic PET images closely resemble real PET scans in
both structural details (SSIM = 0.920 +/- 0.003) and regional patterns
(Pearson's r = 0.955 +/- 0.007). Diagnostic outcomes using synthetic PET show
high agreement with real PET-based diagnoses (accuracy = 0.80). Using synthetic
PET, we developed a fully automatic AD diagnostic pipeline integrating PET
synthesis and classification. The synthetic PET-based model (AUC = 0.78)
outperforms T1-based (AUC = 0.68) and BBM-based (AUC = 0.73) models, while
combining synthetic PET and BBMs further improved performance (AUC = 0.79).
Ablation analysis supports the advantages of LLM integration and prompt
engineering. Interpretation: Our language-enhanced generative model synthesizes
realistic PET images, enhancing the utility of MRI and BBMs for Abeta spatial
pattern assessment and improving the diagnostic workflow for Alzheimer's
disease.
comment: 31 pages, 8 figures
☆ OmniField: Conditioned Neural Fields for Robust Multimodal Spatiotemporal Learning
Multimodal spatiotemporal learning on real-world experimental data is
constrained by two challenges: within-modality measurements are sparse,
irregular, and noisy (QA/QC artifacts) but cross-modally correlated; the set of
available modalities varies across space and time, shrinking the usable record
unless models can adapt to arbitrary subsets at train and test time. We propose
OmniField, a continuity-aware framework that learns a continuous neural field
conditioned on available modalities and iteratively fuses cross-modal context.
A multimodal crosstalk block architecture paired with iterative cross-modal
refinement aligns signals prior to the decoder, enabling unified
reconstruction, interpolation, forecasting, and cross-modal prediction without
gridding or surrogate preprocessing. Extensive evaluations show that OmniField
consistently outperforms eight strong multimodal spatiotemporal baselines.
Under heavy simulated sensor noise, performance remains close to clean-input
levels, highlighting robustness to corrupted measurements.
comment: 25 pages, 12 figures, 8 tables
☆ MM-UNet: Morph Mamba U-shaped Convolutional Networks for Retinal Vessel Segmentation
Jiawen Liu, Yuanbo Zeng, Jiaming Liang, Yizhen Yang, Yiheng Zhang, Enhui Cai, Xiaoqi Sheng, Hongmin Cai
Accurate detection of retinal vessels plays a critical role in reflecting a
wide range of health status indicators in the clinical diagnosis of ocular
diseases. Recently, advances in deep learning have led to a surge in retinal
vessel segmentation methods, which have significantly contributed to the
quantitative analysis of vascular morphology. However, retinal vasculature
differs significantly from conventional segmentation targets in that it
consists of extremely thin and branching structures, whose global morphology
varies greatly across images. These characteristics continue to pose challenges
to segmentation precision and robustness. To address these issues, we propose
MM-UNet, a novel architecture tailored for efficient retinal vessel
segmentation. The model incorporates Morph Mamba Convolution layers, which
replace pointwise convolutions to enhance branching topological perception
through morph, state-aware feature sampling. Additionally, Reverse Selective
State Guidance modules integrate reverse guidance theory with state-space
modeling to improve geometric boundary awareness and decoding efficiency.
Extensive experiments conducted on two public retinal vessel segmentation
datasets demonstrate the superior performance of the proposed method in
segmentation accuracy. Compared to the existing approaches, MM-UNet achieves
F1-score gains of 1.64 $\%$ on DRIVE and 1.25 $\%$ on STARE, demonstrating its
effectiveness and advancement. The project code is public via
https://github.com/liujiawen-jpg/MM-UNet.
comment: This paper was accepted by IEEE BIBM 2025 conference
☆ Pinpointing Trigger Moment for Grounded Video QA: Enhancing Spatio-temporal Grounding in Multimodal Large Language Models ICCV2025
In this technical report, we introduce a framework to address Grounded Video
Question Answering (GVQA) task for the ICCV 2025 Perception Test Challenge. The
GVQA task demands robust multimodal models capable of complex reasoning over
video content, grounding the resulting answers visually, and tracking the
referenced objects temporally. To achieve this capability, our proposed
approach decomposes the GVQA task into a three-stage pipeline: (1) Video
Reasoning \& QA, (2) Spatio-temporal Grounding and (3) Tracking. Our key
contribution is the introduction of a trigger moment, derived from our proposed
CORTEX prompt, which pinpoints the single most visible frame of a target object
to serve as a robust anchor for grounding and tracking. To this end, we achieve
the HOTA score of 0.4968, which marks a significant improvement over the
previous year's winning score of 0.2704 on GVQA task.
comment: 1st place winner of Grounded Videoqa track at the ICCV2025 Perception
Test
☆ Autobiasing Event Cameras for Flickering Mitigation
Understanding and mitigating flicker effects caused by rapid variations in
light intensity is critical for enhancing the performance of event cameras in
diverse environments. This paper introduces an innovative autonomous mechanism
for tuning the biases of event cameras, effectively addressing flicker across a
wide frequency range -25 Hz to 500 Hz. Unlike traditional methods that rely on
additional hardware or software for flicker filtering, our approach leverages
the event cameras inherent bias settings. Utilizing a simple Convolutional
Neural Networks -CNNs, the system identifies instances of flicker in a spatial
space and dynamically adjusts specific biases to minimize its impact. The
efficacy of this autobiasing system was robustly tested using a face detector
framework under both well-lit and low-light conditions, as well as across
various frequencies. The results demonstrated significant improvements:
enhanced YOLO confidence metrics for face detection, and an increased
percentage of frames capturing detected faces. Moreover, the average gradient,
which serves as an indicator of flicker presence through edge detection,
decreased by 38.2 percent in well-lit conditions and by 53.6 percent in
low-light conditions. These findings underscore the potential of our approach
to significantly improve the functionality of event cameras in a range of
adverse lighting scenarios.
☆ Fast Measuring Pavement Crack Width by Cascading Principal Component Analysis
Accurate quantification of pavement crack width plays a pivotal role in
assessing structural integrity and guiding maintenance interventions. However,
achieving precise crack width measurements presents significant challenges due
to: (1) the complex, non-uniform morphology of crack boundaries, which limits
the efficacy of conventional approaches, and (2) the demand for rapid
measurement capabilities from arbitrary pixel locations to facilitate
comprehensive pavement condition evaluation. To overcome these limitations,
this study introduces a cascaded framework integrating Principal Component
Analysis (PCA) and Robust PCA (RPCA) for efficient crack width extraction from
digital images. The proposed methodology comprises three sequential stages: (1)
initial crack segmentation using established detection algorithms to generate a
binary representation, (2) determination of the primary orientation axis for
quasi-parallel cracks through PCA, and (3) extraction of the Main Propagation
Axis (MPA) for irregular crack geometries using RPCA. Comprehensive evaluations
were conducted across three publicly available datasets, demonstrating that the
proposed approach achieves superior performance in both computational
efficiency and measurement accuracy compared to existing state-of-the-art
techniques.
☆ From Instance Segmentation to 3D Growth Trajectory Reconstruction in Planktonic Foraminifera
Planktonic foraminifera, marine protists characterized by their intricate
chambered shells, serve as valuable indicators of past and present
environmental conditions. Understanding their chamber growth trajectory
provides crucial insights into organismal development and ecological adaptation
under changing environments. However, automated tracing of chamber growth from
imaging data remains largely unexplored, with existing approaches relying
heavily on manual segmentation of each chamber, which is time-consuming and
subjective. In this study, we propose an end-to-end pipeline that integrates
instance segmentation, a computer vision technique not extensively explored in
foraminifera, with a dedicated chamber ordering algorithm to automatically
reconstruct three-dimensional growth trajectories from high-resolution computed
tomography scans. We quantitatively and qualitatively evaluate multiple
instance segmentation methods, each optimized for distinct spatial features of
the chambers, and examine their downstream influence on growth-order
reconstruction accuracy. Experimental results on expert-annotated datasets
demonstrate that the proposed pipeline substantially reduces manual effort
while maintaining biologically meaningful accuracy. Although segmentation
models exhibit under-segmentation in smaller chambers due to reduced voxel
fidelity and subtle inter-chamber connectivity, the chamber-ordering algorithm
remains robust, achieving consistent reconstruction of developmental
trajectories even under partial segmentation. This work provides the first
fully automated and reproducible pipeline for digital foraminiferal growth
analysis, establishing a foundation for large-scale, data-driven ecological
studies.
♻ ☆ GS-Verse: Mesh-based Gaussian Splatting for Physics-aware Interaction in Virtual Reality
Anastasiya Pechko, Piotr Borycki, Joanna Waczyńska, Daniel Barczyk, Agata Szymańska, Sławomir Tadeja, Przemysław Spurek
As the demand for immersive 3D content grows, the need for intuitive and
efficient interaction methods becomes paramount. Current techniques for
physically manipulating 3D content within Virtual Reality (VR) often face
significant limitations, including reliance on engineering-intensive processes
and simplified geometric representations, such as tetrahedral cages, which can
compromise visual fidelity and physical accuracy. In this paper, we introduce
GS-Verse (Gaussian Splatting for Virtual Environment Rendering and Scene
Editing), a novel method designed to overcome these challenges by directly
integrating an object's mesh with a Gaussian Splatting (GS) representation. Our
approach enables more precise surface approximation, leading to highly
realistic deformations and interactions. By leveraging existing 3D mesh assets,
GS-Verse facilitates seamless content reuse and simplifies the development
workflow. Moreover, our system is designed to be physics-engine-agnostic,
granting developers robust deployment flexibility. This versatile architecture
delivers a highly realistic, adaptable, and intuitive approach to interactive
3D manipulation. We rigorously validate our method against the current
state-of-the-art technique that couples VR with GS in a comparative user study
involving 18 participants. Specifically, we demonstrate that our approach is
statistically significantly better for physics-aware stretching manipulation
and is also more consistent in other physics-based manipulations like twisting
and shaking. Further evaluation across various interactions and scenes confirms
that our method consistently delivers high and reliable performance, showing
its potential as a plausible alternative to existing methods.
♻ ☆ DIsoN: Decentralized Isolation Networks for Out-of-Distribution Detection in Medical Imaging NeurIPS 2025
Safe deployment of machine learning (ML) models in safety-critical domains
such as medical imaging requires detecting inputs with characteristics not seen
during training, known as out-of-distribution (OOD) detection, to prevent
unreliable predictions. Effective OOD detection after deployment could benefit
from access to the training data, enabling direct comparison between test
samples and the training data distribution to identify differences.
State-of-the-art OOD detection methods, however, either discard the training
data after deployment or assume that test samples and training data are
centrally stored together, an assumption that rarely holds in real-world
settings. This is because shipping the training data with the deployed model is
usually impossible due to the size of training databases, as well as
proprietary or privacy constraints. We introduce the Isolation Network, an OOD
detection framework that quantifies the difficulty of separating a target test
sample from the training data by solving a binary classification task. We then
propose Decentralized Isolation Networks (DIsoN), which enables the comparison
of training and test data when data-sharing is impossible, by exchanging only
model parameters between the remote computational nodes of training and
deployment. We further extend DIsoN with class-conditioning, comparing a target
sample solely with training data of its predicted class. We evaluate DIsoN on
four medical imaging datasets (dermatology, chest X-ray, breast ultrasound,
histopathology) across 12 OOD detection tasks. DIsoN performs favorably against
existing methods while respecting data-privacy. This decentralized OOD
detection framework opens the way for a new type of service that ML developers
could provide along with their models: providing remote, secure utilization of
their training data for OOD detection services. Code:
https://github.com/FelixWag/DIsoN
comment: Accepted at NeurIPS 2025
♻ ☆ A Practical Investigation of Spatially-Controlled Image Generation with Transformers
Enabling image generation models to be spatially controlled is an important
area of research, empowering users to better generate images according to their
own fine-grained specifications via e.g. edge maps, poses. Although this task
has seen impressive improvements in recent times, a focus on rapidly producing
stronger models has come at the cost of detailed and fair scientific
comparison. Differing training data, model architectures and generation
paradigms make it difficult to disentangle the factors contributing to
performance. Meanwhile, the motivations and nuances of certain approaches
become lost in the literature. In this work, we aim to provide clear takeaways
across generation paradigms for practitioners wishing to develop
transformer-based systems for spatially-controlled generation, clarifying the
literature and addressing knowledge gaps. We perform controlled experiments on
ImageNet across diffusion-based/flow-based and autoregressive (AR) models.
First, we establish control token prefilling as a simple, general and
performant baseline approach for transformers. We then investigate previously
underexplored sampling time enhancements, showing that extending
classifier-free guidance to control, as well as softmax truncation, have a
strong impact on control-generation consistency. Finally, we re-clarify the
motivation of adapter-based approaches, demonstrating that they mitigate
"forgetting" and maintain generation quality when trained on limited downstream
data, but underperform full training in terms of generation-control
consistency.
comment: TMLR https://openreview.net/forum?id=loT6xhgLYK
♻ ☆ The Coralscapes Dataset: Semantic Scene Understanding in Coral Reefs
Coral reefs are declining worldwide due to climate change and local
stressors. To inform effective conservation or restoration, monitoring at the
highest possible spatial and temporal resolution is necessary. Conventional
coral reef surveying methods are limited in scalability due to their reliance
on expert labor time, motivating the use of computer vision tools to automate
the identification and abundance estimation of live corals from images.
However, the design and evaluation of such tools has been impeded by the lack
of large high quality datasets. We release the Coralscapes dataset, the first
general-purpose dense semantic segmentation dataset for coral reefs, covering
2075 images, 39 benthic classes, and 174k segmentation masks annotated by
experts. Coralscapes has a similar scope and the same structure as the widely
used Cityscapes dataset for urban scene segmentation, allowing benchmarking of
semantic segmentation models in a new challenging domain which requires expert
knowledge to annotate. We benchmark a wide range of semantic segmentation
models, and find that transfer learning from Coralscapes to existing smaller
datasets consistently leads to state-of-the-art performance. Coralscapes will
catalyze research on efficient, scalable, and standardized coral reef surveying
methods based on computer vision, and holds the potential to streamline the
development of underwater ecological robotics.
♻ ☆ Image Super-Resolution with Guarantees via Conformalized Generative Models NeurIPS 2025
The increasing use of generative ML foundation models for image restoration
tasks such as super-resolution calls for robust and interpretable uncertainty
quantification methods. We address this need by presenting a novel approach
based on conformal prediction techniques to create a 'confidence mask' capable
of reliably and intuitively communicating where the generated image can be
trusted. Our method is adaptable to any black-box generative model, including
those locked behind an opaque API, requires only easily attainable data for
calibration, and is highly customizable via the choice of a local image
similarity metric. We prove strong theoretical guarantees for our method that
span fidelity error control (according to our local image similarity metric),
reconstruction quality, and robustness in the face of data leakage. Finally, we
empirically evaluate these results and establish our method's solid
performance.
comment: To appear at NeurIPS 2025. 17 pages, 7 figures
♻ ☆ Positive Semi-definite Latent Factor Grouping-Boosted Cluster-reasoning Instance Disentangled Learning for WSI Representation
Multiple instance learning (MIL) has been widely used for representing
whole-slide pathology images. However, spatial, semantic, and decision
entanglements among instances limit its representation and interpretability. To
address these challenges, we propose a latent factor grouping-boosted
cluster-reasoning instance disentangled learning framework for whole-slide
image (WSI) interpretable representation in three phases. First, we introduce a
novel positive semi-definite latent factor grouping that maps instances into a
latent subspace, effectively mitigating spatial entanglement in MIL. To
alleviate semantic entanglement, we employs instance probability counterfactual
inference and optimization via cluster-reasoning instance disentangling.
Finally, we employ a generalized linear weighted decision via instance effect
re-weighting to address decision entanglement. Extensive experiments on
multicentre datasets demonstrate that our model outperforms all
state-of-the-art models. Moreover, it attains pathologist-aligned
interpretability through disentangled representations and a transparent
decision-making process.
comment: Our code is available at https://github.com/Prince-Lee-PathAI/PG-CIDL
♻ ☆ Advances in Feed-Forward 3D Reconstruction and View Synthesis: A Survey
Jiahui Zhang, Yuelei Li, Anpei Chen, Muyu Xu, Kunhao Liu, Jianyuan Wang, Xiao-Xiao Long, Hanxue Liang, Zexiang Xu, Hao Su, Christian Theobalt, Christian Rupprecht, Andrea Vedaldi, Kaichen Zhou, Paul Pu Liang, Shijian Lu, Fangneng Zhan
3D reconstruction and view synthesis are foundational problems in computer
vision, graphics, and immersive technologies such as augmented reality (AR),
virtual reality (VR), and digital twins. Traditional methods rely on
computationally intensive iterative optimization in a complex chain, limiting
their applicability in real-world scenarios. Recent advances in feed-forward
approaches, driven by deep learning, have revolutionized this field by enabling
fast and generalizable 3D reconstruction and view synthesis. This survey offers
a comprehensive review of feed-forward techniques for 3D reconstruction and
view synthesis, with a taxonomy according to the underlying representation
architectures including point cloud, 3D Gaussian Splatting (3DGS), Neural
Radiance Fields (NeRF), etc. We examine key tasks such as pose-free
reconstruction, dynamic 3D reconstruction, and 3D-aware image and video
synthesis, highlighting their applications in digital humans, SLAM, robotics,
and beyond. In addition, we review commonly used datasets with detailed
statistics, along with evaluation protocols for various downstream tasks. We
conclude by discussing open research challenges and promising directions for
future work, emphasizing the potential of feed-forward approaches to advance
the state of the art in 3D vision.
comment: A project page associated with this survey is available at
https://fnzhan.com/projects/Feed-Forward-3D
♻ ☆ Prompt to Restore, Restore to Prompt: Cyclic Prompting for Universal Adverse Weather Removal
Universal adverse weather removal (UAWR) seeks to address various weather
degradations within a unified framework. Recent methods are inspired by prompt
learning using pre-trained vision-language models (e.g., CLIP), leveraging
degradation-aware prompts to facilitate weather-free image restoration,
yielding significant improvements. In this work, we propose CyclicPrompt, an
innovative cyclic prompt approach designed to enhance the effectiveness,
adaptability, and generalizability of UAWR. CyclicPrompt Comprises two key
components: 1) a composite context prompt that integrates weather-related
information and context-aware representations into the network to guide
restoration. This prompt differs from previous methods by marrying learnable
input-conditional vectors with weather-specific knowledge, thereby improving
adaptability across various degradations. 2) The erase-and-paste mechanism,
after the initial guided restoration, substitutes weather-specific knowledge
with constrained restoration priors, inducing high-quality weather-free
concepts into the composite prompt to further fine-tune the restoration
process. Therefore, we can form a cyclic "Prompt-Restore-Prompt" pipeline that
adeptly harnesses weather-specific knowledge, textual contexts, and reliable
textures. Extensive experiments on synthetic and real-world datasets validate
the superior performance of CyclicPrompt. The code is available at:
https://github.com/RongxinL/CyclicPrompt.
♻ ☆ Mobile Robotic Multi-View Photometric Stereo SP
Multi-View Photometric Stereo (MVPS) is a popular method for fine-detailed 3D
acquisition of an object from images. Despite its outstanding results on
diverse material objects, a typical MVPS experimental setup requires a
well-calibrated light source and a monocular camera installed on an immovable
base. This restricts the use of MVPS on a movable platform, limiting us from
taking MVPS benefits in 3D acquisition for mobile robotics applications. To
this end, we introduce a new mobile robotic system for MVPS. While the proposed
system brings advantages, it introduces additional algorithmic challenges.
Addressing them, in this paper, we further propose an incremental approach for
mobile robotic MVPS. Our approach leverages a supervised learning setup to
predict per-view surface normal, object depth, and per-pixel uncertainty in
model-predicted results. A refined depth map per view is obtained by solving an
MVPS-driven optimization problem proposed in this paper. Later, we fuse the
refined depth map while tracking the camera pose w.r.t the reference frame to
recover globally consistent object 3D geometry. Experimental results show the
advantages of our robotic system and algorithm, featuring the local
high-frequency surface detail recovery with globally consistent object shape.
Our work is beyond any MVPS system yet presented, providing encouraging results
on objects with unknown reflectance properties using fewer frames without a
tiring calibration and installation process, enabling computationally efficient
robotic automation approach to photogrammetry. The proposed approach is nearly
100 times computationally faster than the state-of-the-art MVPS methods such as
[1, 2] while maintaining the similar results when tested on subjects taken from
the benchmark DiLiGenT MV dataset [3].
comment: Acknowledgment Added. Published at International Society Journal of
Photogrammetry and Remote Sensing (ISPRS). 32 pages, 14 Figures, 5 Tables
♻ ☆ GeoLLaVA-8K: Scaling Remote-Sensing Multimodal Large Language Models to 8K Resolution
Fengxiang Wang, Mingshuo Chen, Yueying Li, Di Wang, Haotian Wang, Zonghao Guo, Zefan Wang, Boqi Shan, Long Lan, Yulin Wang, Hongzhen Wang, Wenjing Yang, Bo Du, Jing Zhang
Ultra-high-resolution (UHR) remote sensing (RS) imagery offers valuable data
for Earth observation but pose challenges for existing multimodal foundation
models due to two key bottlenecks: (1) limited availability of UHR training
data, and (2) token explosion caused by the large image size. To address data
scarcity, we introduce SuperRS-VQA (avg. 8,376$\times$8,376) and HighRS-VQA
(avg. 2,000$\times$1,912), the highest-resolution vision-language datasets in
RS to date, covering 22 real-world dialogue tasks. To mitigate token explosion,
our pilot studies reveal significant redundancy in RS images: crucial
information is concentrated in a small subset of object-centric tokens, while
pruning background tokens (e.g., ocean or forest) can even improve performance.
Motivated by these findings, we propose two strategies: Background Token
Pruning and Anchored Token Selection, to reduce the memory footprint while
preserving key semantics.Integrating these techniques, we introduce
GeoLLaVA-8K, the first RS-focused multimodal large language model capable of
handling inputs up to 8K$\times$8K resolution, built on the LLaVA framework.
Trained on SuperRS-VQA and HighRS-VQA, GeoLLaVA-8K sets a new state-of-the-art
on the XLRS-Bench.
comment: NeurlPS 2025 Spotlight
♻ ☆ Label tree semantic losses for rich multi-class medical image segmentation
Rich and accurate medical image segmentation is poised to underpin the next
generation of AI-defined clinical practice by delineating critical anatomy for
pre-operative planning, guiding real-time intra-operative navigation, and
supporting precise post-operative assessment. However, commonly used learning
methods for medical and surgical imaging segmentation tasks penalise all errors
equivalently and thus fail to exploit any inter-class semantics in the labels
space. This becomes particularly problematic as the cardinality and richness of
labels increases to include subtly different classes. In this work, we propose
two tree-based semantic loss functions which take advantage of a hierarchical
organisation of the labels. We further incorporate our losses in a recently
proposed approach for training with sparse, background-free annotations to
extend the applicability of our proposed losses. Extensive experiments are
reported on two medical and surgical image segmentation tasks, namely head MRI
for whole brain parcellation (WBP) with full supervision and neurosurgical
hyperspectral imaging (HSI) for scene understanding with sparse annotations.
Results demonstrate that our proposed method reaches state-of-the-art
performance in both cases.
♻ ☆ Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model
Recently, augmenting vision-language-action models (VLAs) with world-models
has shown promise in robotic policy learning. However, it remains challenging
to jointly predict next-state observations and action sequences because of the
inherent difference between the two modalities. To address this, we propose
DUal-STream diffusion (DUST), a world-model augmented VLA framework that
handles the modality conflict and enhances the performance of VLAs across
diverse tasks. Specifically, we propose a multimodal diffusion transformer
architecture that explicitly maintains separate modality streams while enabling
cross-modal knowledge sharing. In addition, we propose training techniques such
as independent noise perturbations for each modality and a decoupled flow
matching loss, which enables the model to learn the joint distribution in a
bidirectional manner while avoiding the need for a unified latent space.
Furthermore, based on the decoupled training framework, we introduce a sampling
method where we sample action and vision tokens asynchronously at different
rates, which shows improvement through inference-time scaling. Through
experiments on simulated benchmarks such as RoboCasa and GR-1, DUST achieves up
to 6% gains over a standard VLA baseline and implicit world-modeling methods,
with our inference-time scaling approach providing an additional 2-5% gain on
success rate. On real-world tasks with the Franka Research 3, DUST outperforms
baselines in success rate by 13%, confirming its effectiveness beyond
simulation. Lastly, we demonstrate the effectiveness of DUST in large-scale
pretraining with action-free videos from BridgeV2, where DUST leads to
significant gain when transferred to the RoboCasa benchmark.
comment: 20 pages, 10 figures
♻ ☆ Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions ICCV2025
As AI systems become increasingly integrated into human lives, endowing them
with robust social intelligence has emerged as a critical frontier. A key
aspect of this intelligence is discerning truth from deception, a ubiquitous
element of human interaction that is conveyed through a complex interplay of
verbal language and non-verbal visual cues. However, automatic deception
detection in dynamic, multi-party conversations remains a significant
challenge. The recent rise of powerful Multimodal Large Language Models
(MLLMs), with their impressive abilities in visual and textual understanding,
makes them natural candidates for this task. Consequently, their capabilities
in this crucial domain are mostly unquantified. To address this gap, we
introduce a new task, Multimodal Interactive Veracity Assessment (MIVA), and
present a novel multimodal dataset derived from the social deduction game
Werewolf. This dataset provides synchronized video, text, with verifiable
ground-truth labels for every statement. We establish a comprehensive benchmark
evaluating state-of-the-art MLLMs, revealing a significant performance gap:
even powerful models like GPT-4o struggle to distinguish truth from falsehood
reliably. Our analysis of failure modes indicates that these models fail to
ground language in visual social cues effectively and may be overly
conservative in their alignment, highlighting the urgent need for novel
approaches to building more perceptive and trustworthy AI systems.
comment: ICCV2025 Workshop
♻ ☆ Rethinking Video Super-Resolution: Towards Diffusion-Based Methods without Motion Alignment SP
In this work, we rethink the approach to video super-resolution by
introducing a method based on the Diffusion Posterior Sampling framework,
combined with an unconditional video diffusion transformer operating in latent
space. The video generation model, a diffusion transformer, functions as a
space-time model. We argue that a powerful model, which learns the physics of
the real world, can easily handle various kinds of motion patterns as prior
knowledge, thus eliminating the need for explicit estimation of optical flows
or motion parameters for pixel alignment. Furthermore, a single instance of the
proposed video diffusion transformer model can adapt to different sampling
conditions without re-training. Empirical results on synthetic and real-world
datasets illustrate the feasibility of diffusion-based, alignment-free video
super-resolution.
comment: ICSPS 2025
♻ ☆ Robust Identity Perceptual Watermark Against Deepfake Face Swapping
Notwithstanding offering convenience and entertainment to society, Deepfake
face swapping has caused critical privacy issues with the rapid development of
deep generative models. Due to imperceptible artifacts in high-quality
synthetic images, passive detection models against face swapping in recent
years usually suffer performance damping regarding the generalizability issue
in cross-domain scenarios. Therefore, several studies have been attempted to
proactively protect the original images against malicious manipulations by
inserting invisible signals in advance. However, existing proactive defense
approaches demonstrate unsatisfactory results with respect to visual quality,
detection accuracy, and source tracing ability. In this study, to fulfill the
research gap, we propose a robust identity perceptual watermarking framework
that concurrently performs detection and source tracing against Deepfake face
swapping proactively. We innovatively assign identity semantics regarding the
image contents to the watermarks and devise an unpredictable and nonreversible
chaotic encryption system to ensure watermark confidentiality. The watermarks
are robustly encoded and recovered by jointly training an encoder-decoder
framework along with adversarial image manipulations. For a suspect image,
falsification is accomplished by justifying the consistency between the
content-matched identity perceptual watermark and the recovered robust
watermark, without requiring the ground-truth. Moreover, source tracing can be
accomplished based on the identity semantics that the recovered watermark
carries. Extensive experiments demonstrate state-of-the-art detection and
source tracing performance against Deepfake face swapping with promising
watermark robustness for both cross-dataset and cross-manipulation settings.
comment: In peer review
♻ ☆ RoMA: Scaling up Mamba-based Foundation Models for Remote Sensing NeurIPS 2025
Fengxiang Wang, Yulin Wang, Mingshuo Chen, Haiyan Zhao, Yangang Sun, Shuo Wang, Hongzhen Wang, Di Wang, Long Lan, Wenjing Yang, Jing Zhang
Recent advances in self-supervised learning for Vision Transformers (ViTs)
have fueled breakthroughs in remote sensing (RS) foundation models. However,
the quadratic complexity of self-attention poses a significant barrier to
scalability, particularly for large models and high-resolution images. While
the linear-complexity Mamba architecture offers a promising alternative,
existing RS applications of Mamba remain limited to supervised tasks on small,
domain-specific datasets. To address these challenges, we propose RoMA, a
framework that enables scalable self-supervised pretraining of Mamba-based RS
foundation models using large-scale, diverse, unlabeled data. RoMA enhances
scalability for high-resolution images through a tailored auto-regressive
learning strategy, incorporating two key innovations: 1) a rotation-aware
pretraining mechanism combining adaptive cropping with angular embeddings to
handle sparsely distributed objects with arbitrary orientations, and 2)
multi-scale token prediction objectives that address the extreme variations in
object scales inherent to RS imagery. Systematic empirical studies validate
that Mamba adheres to RS data and parameter scaling laws, with performance
scaling reliably as model and data size increase. Furthermore, experiments
across scene classification, object detection, and semantic segmentation tasks
demonstrate that RoMA-pretrained Mamba models consistently outperform ViT-based
counterparts in both accuracy and computational efficiency. The source code and
pretrained models will be released at https://github.com/MiliLab/RoMA.
comment: NeurIPS 2025
♻ ☆ ThinkMorph: Emergent Properties in Multimodal Interleaved Chain-of-Thought Reasoning
Jiawei Gu, Yunzhuo Hao, Huichen Will Wang, Linjie Li, Michael Qizhe Shieh, Yejin Choi, Ranjay Krishna, Yu Cheng
Multimodal reasoning requires iterative coordination between language and
vision, yet it remains unclear what constitutes a meaningful interleaved chain
of thought. We posit that text and image thoughts should function as
complementary rather than isomorphic modalities that mutually advance
reasoning. Guided by this principle, we build ThinkMorph, a unified model
fine-tuned on approximately 24K high-quality interleaved reasoning traces
spanning tasks with varying visual engagement. ThinkMorph learns to generate
progressive text-image reasoning steps that concretely manipulate visual
content while maintaining coherent verbal logic. It delivers large gains on
vision-centric benchmarks (averaging 34.7 percent over the base model) and
generalizes to out-of-domain tasks, matching or surpassing larger and
proprietary VLMs. Beyond performance, ThinkMorph exhibits emergent multimodal
intelligence, including unseen visual manipulation skills, adaptive switching
between reasoning modes, and better test-time scaling through diversified
multimodal thoughts. These findings suggest promising directions for
characterizing the emergent capabilities of unified models for multimodal
reasoning.
comment: project page: https://thinkmorph.github.io/
♻ ☆ Training Convolutional Neural Networks with the Forward-Forward algorithm SC
Recent successes in image analysis with deep neural networks are achieved
almost exclusively with Convolutional Neural Networks (CNNs), typically trained
using the backpropagation (BP) algorithm. In a 2022 preprint, Geoffrey Hinton
proposed the Forward-Forward (FF) algorithm as a biologically inspired
alternative, where positive and negative examples are jointly presented to the
network and training is guided by a locally defined goodness function. Here, we
extend the FF paradigm to CNNs. We introduce two spatially extended labeling
strategies, based on Fourier patterns and morphological transformations, that
enable convolutional layers to access label information across all spatial
positions. On CIFAR10, we show that deeper FF-trained CNNs can be optimized
successfully and that morphology-based labels prevent shortcut solutions on
dataset with more complex and fine features. On CIFAR100, carefully designed
label sets scale effectively to 100 classes. Class Activation Maps reveal that
FF-trained CNNs learn meaningful and complementary features across layers.
Together, these results demonstrate that FF training is feasible beyond fully
connected networks, provide new insights into its learning dynamics and
stability, and highlight its potential for neuromorphic computing and
biologically inspired learning.
comment: PEER-REVIEWED VERSION PUBLISHED ON "SCIENTIFIC REPORTS" (2025) DOI:
10.1038/s41598-025-26235-2
♻ ☆ Uniworld-V2: Reinforce Image Editing with Diffusion Negative-aware Finetuning and MLLM Implicit Feedback
Zongjian Li, Zheyuan Liu, Qihui Zhang, Bin Lin, Feize Wu, Shenghai Yuan, Zhiyuan Yan, Yang Ye, Wangbo Yu, Yuwei Niu, Shaodong Wang, Xinhua Cheng, Li Yuan
Instruction-based image editing has achieved remarkable progress; however,
models solely trained via supervised fine-tuning often overfit to annotated
patterns, hindering their ability to explore and generalize beyond training
distributions. To this end, we introduce Edit-R1, a novel post-training
framework for instruction-based image editing based on policy optimization.
Specifically, we utilize Diffusion Negative-aware Finetuning (DiffusionNFT), a
likelihood-free policy optimization method consistent with the flow matching
forward process, thereby enabling the use of higher-order samplers and more
efficient training. Another key challenge here is the absence of a universal
reward model, resulting from the diverse nature of editing instructions and
tasks. To bridge this gap, we employ a Multimodal Large Language Model (MLLM)
as a unified, training-free reward model, leveraging its output logits to
provide fine-grained feedback. Furthermore, we carefully design a low-variance
group filtering mechanism to reduce MLLM scoring noise and stabilize
optimization. \texttt{UniWorld-V2}, trained with this framework, achieves
\textbf{state-of-the-art} results on the ImgEdit and GEdit-Bench benchmarks,
scoring 4.49 and 7.83, respectively. Crucially, our framework is
model-agnostic, delivering substantial performance gains when applied to
diverse base models like Qwen-Image-Edit and FLUX-Kontext, demonstrating its
wide applicability. Code and models are publicly available to support further
research.
♻ ☆ OmniEarth-Bench: Towards Holistic Evaluation of Earth's Six Spheres and Cross-Spheres Interactions with Multimodal Observational Earth Data
Fengxiang Wang, Mingshuo Chen, Xuming He, Yueying Li, YiFan Zhang, Feng Liu, Zijie Guo, Zhenghao Hu, Jiong Wang, Jingyi Xu, Zhangrui Li, Fenghua Ling, Ben Fei, Weijia Li, Long Lan, Wenjing Yang, Wenlong Zhang, Lei Bai
Existing benchmarks for multimodal learning in Earth science offer limited,
siloed coverage of Earth's spheres and their cross-sphere interactions,
typically restricting evaluation to the human-activity sphere of atmosphere and
to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity
(single/few data sources), constrained scientific granularity, and
limited-sphere extensibility}. Therefore, we introduce
\textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically
spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere,
biosphere, and human-activity sphere, and cross-spheres. Built with a scalable,
modular-topology data inference framework and native multi-observation sources
and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized,
expert-curated annotations. All annotations are organized into a four-level
hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated
evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the
most advanced models struggle with our benchmarks, where none of them reach
35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The
dataset and evaluation code were released at OmniEarth-Bench
(https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
♻ ☆ Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation
Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, Guanghui Ren
We introduce Genie Envisioner (GE), a unified world foundation platform for
robotic manipulation that integrates policy learning, evaluation, and
simulation within a single video-generative framework. At its core, GE-Base is
a large-scale, instruction-conditioned video diffusion model that captures the
spatial, temporal, and semantic dynamics of real-world robotic interactions in
a structured latent space. Built upon this foundation, GE-Act maps latent
representations to executable action trajectories through a lightweight,
flow-matching decoder, enabling precise and generalizable policy inference
across diverse embodiments with minimal supervision. To support scalable
evaluation and training, GE-Sim serves as an action-conditioned neural
simulator, producing high-fidelity rollouts for closed-loop policy development.
The platform is further equipped with EWMBench, a standardized benchmark suite
measuring visual fidelity, physical consistency, and instruction-action
alignment. Together, these components establish Genie Envisioner as a scalable
and practical foundation for instruction-driven, general-purpose embodied
intelligence. All code, models, and benchmarks will be released publicly.
comment: https://genie-envisioner.github.io/
♻ ☆ A Quantitative Evaluation Framework for Explainable AI in Semantic Segmentation
Ensuring transparency and trust in artificial intelligence (AI) models is
essential as they are increasingly deployed in safety-critical and high-stakes
domains. Explainable AI (XAI) has emerged as a promising approach to address
this challenge; however, the rigorous evaluation of XAI methods remains vital
for balancing the trade-offs between model complexity, predictive performance,
and interpretability. While substantial progress has been made in evaluating
XAI for classification tasks, strategies tailored to semantic segmentation
remain limited. Moreover, objectively assessing XAI approaches is difficult,
since qualitative visual explanations provide only preliminary insights. Such
qualitative methods are inherently subjective and cannot ensure the accuracy or
stability of explanations. To address these limitations, this work introduces a
comprehensive quantitative evaluation framework for assessing XAI in semantic
segmentation, accounting for both spatial and contextual task complexities. The
framework systematically integrates pixel-level evaluation strategies with
carefully designed metrics to yield fine-grained interpretability insights.
Simulation results using recently adapted class activation mapping (CAM)-based
XAI schemes demonstrate the efficiency, robustness, and reliability of the
proposed methodology. These findings advance the development of transparent,
trustworthy, and accountable semantic segmentation models.
♻ ☆ Breaking Down Monocular Ambiguity: Exploiting Temporal Evolution for 3D Lane Detection
Monocular 3D lane detection aims to estimate the 3D position of lanes from
frontal-view (FV) images. However, existing methods are fundamentally
constrained by the inherent ambiguity of single-frame input, which leads to
inaccurate geometric predictions and poor lane integrity, especially for
distant lanes.To overcome this, we propose to unlock the rich information
embedded in the temporal evolution of the scene as the vehicle moves. Our
proposed Geometry-aware Temporal Aggregation Network (GTA-Net) systematically
leverages the temporal information from complementary perspectives.First,
Temporal Geometry Enhancement Module (TGEM) learns geometric consistency across
consecutive frames, effectively recovering depth information from motion to
build a reliable 3D scene representation.Second, to enhance lane integrity,
Temporal Instance-aware Query Generation (TIQG) module aggregates instance cues
from past and present frames. Crucially, for lanes that are ambiguous in the
current view, TIQG innovatively synthesizes a pseudo future perspective to
generate queries that reveal lanes which would otherwise be missed.The
experiments demonstrate that GTA-Net achieves new SoTA results, significantly
outperforming existing monocular 3D lane detection solutions.
♻ ☆ Cross-modal Diffusion Modelling for Super-resolved Spatial Transcriptomics
The recent advancement of spatial transcriptomics (ST) allows to characterize
spatial gene expression within tissue for discovery research. However, current
ST platforms suffer from low resolution, hindering in-depth understanding of
spatial gene expression. Super-resolution approaches promise to enhance ST maps
by integrating histology images with gene expressions of profiled tissue spots.
However, current super-resolution methods are limited by restoration
uncertainty and mode collapse. Although diffusion models have shown promise in
capturing complex interactions between multi-modal conditions, it remains a
challenge to integrate histology images and gene expression for super-resolved
ST maps. This paper proposes a cross-modal conditional diffusion model for
super-resolving ST maps with the guidance of histology images. Specifically, we
design a multi-modal disentangling network with cross-modal adaptive modulation
to utilize complementary information from histology images and spatial gene
expression. Moreover, we propose a dynamic cross-attention modelling strategy
to extract hierarchical cell-to-tissue information from histology images.
Lastly, we propose a co-expression-based gene-correlation graph network to
model the co-expression relationship of multiple genes. Experiments show that
our method outperforms other state-of-the-art methods in ST super-resolution on
three public datasets.
♻ ☆ Progressive Growing of Patch Size: Curriculum Learning for Accelerated and Improved Medical Image Segmentation MICCAI2024
Stefan M. Fischer, Johannes Kiechle, Laura Daza, Lina Felsner, Richard Osuala, Daniel M. Lang, Karim Lekadir, Jan C. Peeken, Julia A. Schnabel
In this work, we introduce Progressive Growing of Patch Size, an automatic
curriculum learning approach for 3D medical image segmentation. Our approach
progressively increases the patch size during model training, resulting in an
improved class balance for smaller patch sizes and accelerated convergence of
the training process. We evaluate our curriculum approach in two settings: a
resource-efficient mode and a performance mode, both regarding Dice score
performance and computational costs across 15 diverse and popular 3D medical
image segmentation tasks. The resource-efficient mode matches the Dice score
performance of the conventional constant patch size sampling baseline with a
notable reduction in training time to only 44%. The performance mode improves
upon constant patch size segmentation results, achieving a statistically
significant relative mean performance gain of 1.28% in Dice Score. Remarkably,
across all 15 tasks, our proposed performance mode manages to surpass the
constant patch size baseline in Dice Score performance, while simultaneously
reducing training time to only 89%. The benefits are particularly pronounced
for highly imbalanced tasks such as lesion segmentation tasks. Rigorous
experiments demonstrate that our performance mode not only improves mean
segmentation performance but also reduces performance variance, yielding more
trustworthy model comparison. Furthermore, our findings reveal that the
proposed curriculum sampling is not tied to a specific architecture but
represents a broadly applicable strategy that consistently boosts performance
across diverse segmentation models, including UNet, UNETR, and SwinUNETR. In
summary, we show that this simple yet elegant transformation on input data
substantially improves both Dice Score performance and training runtime, while
being compatible across diverse segmentation backbones.
comment: Journal Extension of "Progressive Growing of Patch Size:
Resource-Efficient Curriculum Learning for Dense Prediction Tasks"
(MICCAI2024) submitted to MedIA
♻ ☆ CGF-DETR: Cross-Gated Fusion DETR for Enhanced Pneumonia Detection in Chest X-rays
Pneumonia remains a leading cause of morbidity and mortality worldwide,
necessitating accurate and efficient automated detection systems. While recent
transformer-based detectors like RT-DETR have shown promise in object detection
tasks, their application to medical imaging, particularly pneumonia detection
in chest X-rays, remains underexplored. This paper presents CGF-DETR, an
enhanced real-time detection transformer specifically designed for pneumonia
detection. We introduce XFABlock in the backbone to improve multi-scale feature
extraction through convolutional attention mechanisms integrated with CSP
architecture. To achieve efficient feature aggregation, we propose SPGA module
that replaces standard multi-head attention with dynamic gating mechanisms and
single-head self-attention. Additionally, GCFC3 is designed for the neck to
enhance feature representation through multi-path convolution fusion while
maintaining real-time performance via structural re-parameterization. Extensive
experiments on the RSNA Pneumonia Detection dataset demonstrate that CGF-DETR
achieves 82.2% mAP@0.5, outperforming the baseline RT-DETR-l by 3.7% while
maintaining comparable inference speed at 48.1 FPS. Our ablation studies
confirm that each proposed module contributes meaningfully to the overall
performance improvement, with the complete model achieving 50.4% mAP@[0.5:0.95]
♻ ☆ Improving Generalization in MRI-Based Deep Learning Models for Total Knee Replacement Prediction
Knee osteoarthritis (KOA) is a common joint disease that causes pain and
mobility issues. While MRI-based deep learning models have demonstrated
superior performance in predicting total knee replacement (TKR) and disease
progression, their generalizability remains challenging, particularly when
applied to imaging data from different sources. In this study, we show that
replacing batch normalization with instance normalization, using data
augmentation, and applying contrastive loss improves generalization. For
training and evaluation, we used MRI data from the Osteoarthritis Initiative
(OAI) database, considering sagittal fat-suppressed intermediate-weighted turbo
spin-echo (FS-IW-TSE) images as the source domain and sagittal fat-suppressed
three-dimensional (3D) dual-echo in steady state (DESS) images as the target
domain. The results demonstrated a statistically significant improvement in
classification metrics across both domains by replacing batch normalization
with instance normalization in the baseline model, generating augmented input
views using the Global Intensity Non-linear (GIN) augmentation method, and
incorporating a supervised contrastive loss alongside the classification loss
to align representations of samples with the same label. The GIN method with
contrastive loss performed better than all evaluated single-source domain
generalization methods when using 3D instance normalization. Comparing GIN with
and without contrastive loss (for both normalization types) showed that adding
contrastive loss consistently led to better performance.
♻ ☆ Research on Expressway Congestion Warning Technology Based on YOLOv11-DIoU and GRU-Attention
Expressway traffic congestion severely reduces travel efficiency and hinders
regional connectivity. Existing "detection-prediction" systems have critical
flaws: low vehicle perception accuracy under occlusion and loss of
long-sequence dependencies in congestion forecasting. This study proposes an
integrated technical framework to resolve these issues.For traffic flow
perception, two baseline algorithms were optimized. Traditional YOLOv11 was
upgraded to YOLOv11-DIoU by replacing GIoU Loss with DIoU Loss, and DeepSort
was improved by fusing Mahalanobis (motion) and cosine (appearance) distances.
Experiments on Chang-Shen Expressway videos showed YOLOv11-DIoU achieved 95.7\%
mAP (6.5 percentage points higher than baseline) with 5.3\% occlusion miss
rate. DeepSort reached 93.8\% MOTA (11.3 percentage points higher than SORT)
with only 4 ID switches. Using the Greenberg model (for 10-15 vehicles/km
high-density scenarios), speed and density showed a strong negative correlation
(r=-0.97), conforming to traffic flow theory. For congestion warning, a
GRU-Attention model was built to capture congestion precursors. Trained 300
epochs with flow, density, and speed, it achieved 99.7\% test accuracy (7-9
percentage points higher than traditional GRU). In 10-minute advance warnings
for 30-minute congestion, time error was $\leq$ 1 minute. Validation with an
independent video showed 95\% warning accuracy, over 90\% spatial overlap of
congestion points, and stable performance in high-flow ($>$5 vehicles/second)
scenarios.This framework provides quantitative support for expressway
congestion control, with promising intelligent transportation applications.
♻ ☆ Towards classification-based representation learning for place recognition on LiDAR scans
Place recognition is a crucial task in autonomous driving, allowing vehicles
to determine their position using sensor data. While most existing methods rely
on contrastive learning, we explore an alternative approach by framing place
recognition as a multi-class classification problem. Our method assigns
discrete location labels to LiDAR scans and trains an encoder-decoder model to
classify each scan's position directly. We evaluate this approach on the
NuScenes dataset and show that it achieves competitive performance compared to
contrastive learning-based methods while offering advantages in training
efficiency and stability.
♻ ☆ MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
Current video-to-audio (V2A) methods struggle in complex multi-event
scenarios (video scenarios involving multiple sound sources, sound events, or
transitions) due to two critical limitations. First, existing methods face
challenges in precisely aligning intricate semantic information together with
rapid dynamic features. Second, foundational training lacks quantitative
preference optimization for semantic-temporal alignment and audio quality. As a
result, it fails to enhance integrated generation quality in cluttered
multi-event scenes. To address these core limitations, this study proposes a
novel V2A framework: MultiSoundGen. It introduces direct preference
optimization (DPO) into the V2A domain, leveraging audio-visual pretraining
(AVP) to enhance performance in complex multi-event scenarios. Our
contributions include two key innovations: the first is SlowFast Contrastive
AVP (SF-CAVP), a pioneering AVP model with a unified dual-stream architecture.
SF-CAVP explicitly aligns core semantic representations and rapid dynamic
features of audio-visual data to handle multi-event complexity; second, we
integrate the DPO method into V2A task and propose AVP-Ranked Preference
Optimization (AVP-RPO). It uses SF-CAVP as a reward model to quantify and
prioritize critical semantic-temporal matches while enhancing audio quality.
Experiments demonstrate that MultiSoundGen achieves state-of-the-art (SOTA)
performance in multi-event scenarios, delivering comprehensive gains across
distribution matching, audio quality, semantic alignment, and temporal
synchronization. Demos are available at
https://v2aresearch.github.io/MultiSoundGen/.
♻ ☆ CWSSNet: Hyperspectral Image Classification Enhanced by Wavelet Domain Convolution
Hyperspectral remote sensing technology has significant application value in
fields such as forestry ecology and precision agriculture, while also putting
forward higher requirements for fine ground object classification. However,
although hyperspectral images are rich in spectral information and can improve
recognition accuracy, they tend to cause prominent feature redundancy due to
their numerous bands, high dimensionality, and spectral mixing characteristics.
To address this, this study used hyperspectral images from the ZY1F satellite
as a data source and selected Yugan County, Shangrao City, Jiangxi Province as
the research area to perform ground object classification research. A
classification framework named CWSSNet was proposed, which integrates 3D
spectral-spatial features and wavelet convolution. This framework integrates
multimodal information us-ing a multiscale convolutional attention module and
breaks through the classification performance bottleneck of traditional methods
by introducing multi-band decomposition and convolution operations in the
wavelet domain. The experiments showed that CWSSNet achieved 74.50\%, 82.73\%,
and 84.94\% in mean Intersection over Union (mIoU), mean Accuracy (mAcc), and
mean F1-score (mF1) respectively in Yugan County. It also obtained the highest
Intersection over Union (IoU) in the classifica-tion of water bodies,
vegetation, and bare land, demonstrating good robustness. Additionally, when
the training set proportion was 70\%, the increase in training time was
limited, and the classification effect was close to the optimal level,
indicating that the model maintains reliable performance under small-sample
training conditions.
♻ ☆ Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Data scarcity is a long-standing challenge in the Vision-Language Navigation
(VLN) field, which extremely hinders the generalization of agents to unseen
environments. Previous works primarily rely on additional simulator data or
web-collected images/videos to improve the generalization. However, the
simulator environments still face limited diversity, and the web-collected data
often requires extensive labor to remove the noise. In this paper, we propose a
Rewriting-driven AugMentation (RAM) paradigm for VLN, which directly creates
the unseen observation-instruction pairs via rewriting human-annotated training
data. Benefiting from our rewriting mechanism, new observation-instruction
pairs can be obtained in both simulator-free and labor-saving manners to
promote generalization. Specifically, we first introduce Object-Enriched
Observation Rewriting, where we combine Vision-Language Models (VLMs) and Large
Language Models (LLMs) to derive rewritten object-enriched scene descriptions,
enabling observation synthesis with diverse objects and spatial layouts via
Text-to-Image Generation Models (T2IMs). Then, we propose Observation-Contrast
Instruction Rewriting, which generates observation-aligned rewritten
instructions by requiring LLMs to reason the difference between original and
new observations. We further develop a mixing-then-focusing training strategy
with a random observation cropping scheme, effectively enhancing data
distribution diversity while suppressing augmentation data noise during
training. Experiments on both the discrete environments (R2R, REVERIE, and R4R
datasets) and continuous environments (R2R-CE dataset) show the superior
performance and impressive generalization ability of our method. Code is
available at https://github.com/SaDil13/VLN-RAM.
comment: Accepted by IEEE Transactions on Neural Networks and Learning Systems
♻ ☆ WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Sami Azam, Asif Karim, Jemima Beissbarth, Amanda Leach
Knowledge distillation (KD) has traditionally relied on a static
teacher-student framework, where a large, well-trained teacher transfers
knowledge to a single student model. However, these approaches often suffer
from knowledge degradation, inefficient supervision, and reliance on either a
very strong teacher model or large labeled datasets. To address these, we
present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that
redefines knowledge transfer through a structured sequence of interconnected
models. Unlike conventional KD, it forms a progressive distillation chain,
where each model not only learns from its predecessor but also refines the
knowledge before passing it forward. This structured knowledge transfer further
enhances feature learning and addresses the limitations of one-step KD. Each
model in the chain is trained on only a fraction of the dataset and shows that
effective learning can be achieved with minimal supervision. Extensive
evaluation on six imaging datasets across otoscopic, microscopic, and magnetic
resonance imaging modalities shows that it generalizes and outperforms existing
methods. Furthermore, the proposed distillation chain resulted in cumulative
accuracy gains of up to +23% over a single backbone trained on the same limited
data, which highlights its potential for real-world adoption.
♻ ☆ Kinematify: Open-Vocabulary Synthesis of High-DoF Articulated Objects
A deep understanding of kinematic structures and movable components is
essential for enabling robots to manipulate objects and model their own
articulated forms. Such understanding is captured through articulated objects,
which are essential for tasks such as physical simulation, motion planning, and
policy learning. However, creating these models, particularly for objects with
high degrees of freedom (DoF), remains a significant challenge. Existing
methods typically rely on motion sequences or strong assumptions from
hand-curated datasets, which hinders scalability. In this paper, we introduce
Kinematify, an automated framework that synthesizes articulated objects
directly from arbitrary RGB images or textual descriptions. Our method
addresses two core challenges: (i) inferring kinematic topologies for high-DoF
objects and (ii) estimating joint parameters from static geometry. To achieve
this, we combine MCTS search for structural inference with geometry-driven
optimization for joint reasoning, producing physically consistent and
functionally valid descriptions. We evaluate Kinematify on diverse inputs from
both synthetic and real-world environments, demonstrating improvements in
registration and kinematic topology accuracy over prior work.
comment: project page: https://sites.google.com/deemos.com/kinematify
♻ ☆ SatFusion: A Unified Framework for Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion
Yufei Tong, Guanjie Cheng, Peihan Wu, Yicheng Zhu, Kexu Lu, Feiyi Chen, Meng Xi, Junqin Huang, Xueqiang Yan, Junfan Wang, Shuiguang Deng
With the rapid advancement of the digital society, the proliferation of
satellites in the Satellite Internet of Things (Sat-IoT) has led to the
continuous accumulation of large-scale multi-temporal and multi-source images
across diverse application scenarios. However, existing methods fail to fully
exploit the complementary information embedded in both temporal and source
dimensions. For example, Multi-Image Super-Resolution (MISR) enhances
reconstruction quality by leveraging temporal complementarity across multiple
observations, yet the limited fine-grained texture details in input images
constrain its performance. Conversely, pansharpening integrates multi-source
images by injecting high-frequency spatial information from panchromatic data,
but typically relies on pre-interpolated low-resolution inputs and assumes
noise-free alignment, making it highly sensitive to noise and misregistration.
To address these issues, we propose SatFusion: A Unified Framework for
Enhancing Satellite IoT Images via Multi-Temporal and Multi-Source Data Fusion.
Specifically, SatFusion first employs a Multi-Temporal Image Fusion (MTIF)
module to achieve deep feature alignment with the panchromatic image. Then, a
Multi-Source Image Fusion (MSIF) module injects fine-grained texture
information from the panchromatic data. Finally, a Fusion Composition module
adaptively integrates the complementary advantages of both modalities while
dynamically refining spectral consistency, supervised by a weighted combination
of multiple loss functions. Extensive experiments on the WorldStrat, WV3, QB,
and GF2 datasets demonstrate that SatFusion significantly improves fusion
quality, robustness under challenging conditions, and generalizability to
real-world Sat-IoT scenarios. The code is available at:
https://github.com/dllgyufei/SatFusion.git.
♻ ☆ Deep Fourier-embedded Network for RGB and Thermal Salient Object Detection
The rapid development of deep learning has significantly improved salient
object detection (SOD) combining both RGB and thermal (RGB-T) images. However,
existing Transformer-based RGB-T SOD models with quadratic complexity are
memory-intensive, limiting their application in high-resolution bimodal feature
fusion. To overcome this limitation, we propose a purely Fourier
Transform-based model, namely Deep Fourier-embedded Network (FreqSal), for
accurate RGB-T SOD. Specifically, we leverage the efficiency of Fast Fourier
Transform with linear complexity to design three key components: (1) To fuse
RGB and thermal modalities, we propose Modal-coordinated Perception Attention,
which aligns and enhances bimodal Fourier representation in multiple
dimensions; (2) To clarify object edges and suppress noise, we design
Frequency-decomposed Edge-aware Block, which deeply decomposes and filters
Fourier components of low-level features; (3) To accurately decode features, we
propose Fourier Residual Channel Attention Block, which prioritizes
high-frequency information while aligning channel-wise global relationships.
Additionally, even when converged, existing deep learning-based SOD models'
predictions still exhibit frequency gaps relative to ground-truth. To address
this problem, we propose Co-focus Frequency Loss, which dynamically weights
hard frequencies during edge frequency reconstruction by cross-referencing
bimodal edge information in the Fourier domain. Extensive experiments on ten
bimodal SOD benchmark datasets demonstrate that FreqSal outperforms twenty-nine
existing state-of-the-art bimodal SOD models. Comprehensive ablation studies
further validate the value and effectiveness of our newly proposed components.
The code is available at https://github.com/JoshuaLPF/FreqSal.
comment: Accepted by TCSVT2025
♻ ☆ Crucial-Diff: A Unified Diffusion Model for Crucial Image and Annotation Synthesis in Data-scarce Scenarios
The scarcity of data in various scenarios, such as medical, industry and
autonomous driving, leads to model overfitting and dataset imbalance, thus
hindering effective detection and segmentation performance. Existing studies
employ the generative models to synthesize more training samples to mitigate
data scarcity. However, these synthetic samples are repetitive or simplistic
and fail to provide "crucial information" that targets the downstream model's
weaknesses. Additionally, these methods typically require separate training for
different objects, leading to computational inefficiencies. To address these
issues, we propose Crucial-Diff, a domain-agnostic framework designed to
synthesize crucial samples. Our method integrates two key modules. The Scene
Agnostic Feature Extractor (SAFE) utilizes a unified feature extractor to
capture target information. The Weakness Aware Sample Miner (WASM) generates
hard-to-detect samples using feedback from the detection results of downstream
model, which is then fused with the output of SAFE module. Together, our
Crucial-Diff framework generates diverse, high-quality training data, achieving
a pixel-level AP of 83.63% and an F1-MAX of 78.12% on MVTec. On polyp dataset,
Crucial-Diff reaches an mIoU of 81.64% and an mDice of 87.69%. Code is publicly
available at https://github.com/JJessicaYao/Crucial-diff.
comment: Accepted by IEEE Transactions on Image Processing (TIP), 2025
♻ ☆ InternSVG: Towards Unified SVG Tasks with Multimodal Large Language Models
Haomin Wang, Jinhui Yin, Qi Wei, Wenguang Zeng, Lixin Gu, Shenglong Ye, Zhangwei Gao, Yaohui Wang, Yanting Zhang, Yuanqi Li, Yanwen Guo, Wenhai Wang, Kai Chen, Yu Qiao, Hongjie Zhang
General SVG modeling remains challenging due to fragmented datasets, limited
transferability of methods across tasks, and the difficulty of handling
structural complexity. In response, we leverage the strong transfer and
generalization capabilities of multimodal large language models (MLLMs) to
achieve unified modeling for SVG understanding, editing, and generation. We
present the InternSVG family, an integrated data-benchmark-model suite. At its
core is SAgoge, the largest and most comprehensive multimodal dataset for SVG
tasks, encompassing both static graphics and dynamic animations. It covers
icons, long-sequence illustrations, scientific diagrams, and dynamic
animations, supporting tasks of varied difficulty levels and providing deeper
hierarchies with richer attributes compared to previous datasets. Based on this
resource, we introduce SArena, a companion benchmark with comprehensive task
definitions and standardized evaluation that aligns with the domains and
difficulty spectrum covered by SAgoge. Building on these foundations, we
propose InternSVG, a unified MLLM for SVG understanding, editing, and
generation with SVG-specific special tokens, subword-based embedding
initialization, and a two-stage training strategy that progresses from short
static SVGs to long-sequence illustrations and complex animations. This unified
formulation induces positive transfer and improves overall performance.
Experiments on SArena and prior benchmark confirm that InternSVG achieves
substantial gains and consistently outperforms leading open and proprietary
counterparts.
♻ ☆ FitPro: A Zero-Shot Framework for Interactive Text-based Pedestrian Retrieval in Open World
Text-based Pedestrian Retrieval (TPR) deals with retrieving specific target
pedestrians in visual scenes according to natural language descriptions.
Although existing methods have achieved progress under constrained settings,
interactive retrieval in the open-world scenario still suffers from limited
model generalization and insufficient semantic understanding. To address these
challenges, we propose FitPro, an open-world interactive zero-shot TPR
framework with enhanced semantic comprehension and cross-scene adaptability.
FitPro has three innovative components: Feature Contrastive Decoding (FCD),
Incremental Semantic Mining (ISM), and Query-aware Hierarchical Retrieval
(QHR). The FCD integrates prompt-guided contrastive decoding to generate
high-quality structured pedestrian descriptions from denoised images,
effectively alleviating semantic drift in zero-shot scenarios. The ISM
constructs holistic pedestrian representations from multi-view observations to
achieve global semantic modeling in multi-turn interactions, thereby improving
robustness against viewpoint shifts and fine-grained variations in
descriptions. The QHR dynamically optimizes the retrieval pipeline according to
query types, enabling efficient adaptation to multi-modal and multi-view
inputs. Extensive experiments on five public datasets and two evaluation
protocols demonstrate that FitPro significantly overcomes the generalization
limitations and semantic modeling constraints of existing methods in
interactive retrieval, paving the way for practical deployment.
comment: 12pages,6 figures
♻ ☆ 3DViT-GAT: A Unified Atlas-Based 3D Vision Transformer and Graph Learning Framework for Major Depressive Disorder Detection Using Structural MRI Data
Major depressive disorder (MDD) is a prevalent mental health condition that
negatively impacts both individual well-being and global public health.
Automated detection of MDD using structural magnetic resonance imaging (sMRI)
and deep learning (DL) methods holds increasing promise for improving
diagnostic accuracy and enabling early intervention. Most existing methods
employ either voxel-level features or handcrafted regional representations
built from predefined brain atlases, limiting their ability to capture complex
brain patterns. This paper develops a unified pipeline that utilizes Vision
Transformers (ViTs) for extracting 3D region embeddings from sMRI data and
Graph Neural Network (GNN) for classification. We explore two strategies for
defining regions: (1) an atlas-based approach using predefined structural and
functional brain atlases, and (2) an cube-based method by which ViTs are
trained directly to identify regions from uniformly extracted 3D patches.
Further, cosine similarity graphs are generated to model interregional
relationships, and guide GNN-based classification. Extensive experiments were
conducted using the REST-meta-MDD dataset to demonstrate the effectiveness of
our model. With stratified 10-fold cross-validation, the best model obtained
81.51\% accuracy, 85.94\% sensitivity, 76.36\% specificity, 80.88\% precision,
and 83.33\% F1-score. Further, atlas-based models consistently outperformed the
cube-based approach, highlighting the importance of using domain-specific
anatomical priors for MDD detection.
comment: 17 pages, 3 figure, 9 tables
♻ ☆ Parameterized Prompt for Incremental Object Detection
Recent studies have demonstrated that incorporating trainable prompts into
pretrained models enables effective incremental learning. However, the
application of prompts in incremental object detection (IOD) remains
underexplored. Existing prompts pool based approaches assume disjoint class
sets across incremental tasks, which are unsuitable for IOD as they overlook
the inherent co-occurrence phenomenon in detection images. In co-occurring
scenarios, unlabeled objects from previous tasks may appear in current task
images, leading to confusion in prompts pool. In this paper, we hold that
prompt structures should exhibit adaptive consolidation properties across
tasks, with constrained updates to prevent catastrophic forgetting. Motivated
by this, we introduce Parameterized Prompts for Incremental Object Detection
(P$^2$IOD). Leveraging neural networks global evolution properties, P$^2$IOD
employs networks as the parameterized prompts to adaptively consolidate
knowledge across tasks. To constrain prompts structure updates, P$^2$IOD
further engages a parameterized prompts fusion strategy. Extensive experiments
on PASCAL VOC2007 and MS COCO datasets demonstrate that P$^2$IOD's
effectiveness in IOD and achieves the state-of-the-art performance among
existing baselines.
♻ ☆ Light Future: Multimodal Action Frame Prediction via InstructPix2Pix WACV 2026
Predicting future motion trajectories is a critical capability across domains
such as robotics, autonomous systems, and human activity forecasting, enabling
safer and more intelligent decision-making. This paper proposes a novel,
efficient, and lightweight approach for robot action prediction, offering
significantly reduced computational cost and inference latency compared to
conventional video prediction models. Importantly, it pioneers the adaptation
of the InstructPix2Pix model for forecasting future visual frames in robotic
tasks, extending its utility beyond static image editing. We implement a deep
learning-based visual prediction framework that forecasts what a robot will
observe 100 frames (10 seconds) into the future, given a current image and a
textual instruction. We repurpose and fine-tune the InstructPix2Pix model to
accept both visual and textual inputs, enabling multimodal future frame
prediction. Experiments on the RoboTWin dataset (generated based on real-world
scenarios) demonstrate that our method achieves superior SSIM and PSNR compared
to state-of-the-art baselines in robot action prediction tasks. Unlike
conventional video prediction models that require multiple input frames, heavy
computation, and slow inference latency, our approach only needs a single image
and a text prompt as input. This lightweight design enables faster inference,
reduced GPU demands, and flexible multimodal control, particularly valuable for
applications like robotics and sports motion trajectory analytics, where motion
trajectory precision is prioritized over visual fidelity.
comment: 9 pages including appendix, 4 tables, 8 figures, to be submitted to
WACV 2026
♻ ☆ Diffusion Transformer meets Multi-level Wavelet Spectrum for Single Image Super-Resolution ICCV 2025
Discrete Wavelet Transform (DWT) has been widely explored to enhance the
performance of image superresolution (SR). Despite some DWT-based methods
improving SR by capturing fine-grained frequency signals, most existing
approaches neglect the interrelations among multiscale frequency sub-bands,
resulting in inconsistencies and unnatural artifacts in the reconstructed
images. To address this challenge, we propose a Diffusion Transformer model
based on image Wavelet spectra for SR (DTWSR). DTWSR incorporates the
superiority of diffusion models and transformers to capture the interrelations
among multiscale frequency sub-bands, leading to a more consistence and
realistic SR image. Specifically, we use a Multi-level Discrete Wavelet
Transform to decompose images into wavelet spectra. A pyramid tokenization
method is proposed which embeds the spectra into a sequence of tokens for
transformer model, facilitating to capture features from both spatial and
frequency domain. A dual-decoder is designed elaborately to handle the distinct
variances in low-frequency and high-frequency sub-bands, without omitting their
alignment in image generation. Extensive experiments on multiple benchmark
datasets demonstrate the effectiveness of our method, with high performance on
both perception quality and fidelity.
comment: ICCV 2025 Oral Paper
♻ ☆ Dual-level Progressive Hardness-Aware Reweighting for Cross-View Geo-Localization
Cross-view geo-localization (CVGL) between drone and satellite imagery
remains challenging due to severe viewpoint gaps and the presence of hard
negatives, which are visually similar but geographically mismatched samples.
Existing mining or reweighting strategies often use static weighting, which is
sensitive to distribution shifts and prone to overemphasizing difficult samples
too early, leading to noisy gradients and unstable convergence. In this paper,
we present a Dual-level Progressive Hardness-aware Reweighting (DPHR) strategy.
At the sample level, a Ratio-based Difficulty-Aware (RDA) module evaluates
relative difficulty and assigns fine-grained weights to negatives. At the batch
level, a Progressive Adaptive Loss Weighting (PALW) mechanism exploits a
training-progress signal to attenuate noisy gradients during early optimization
and progressively enhance hard-negative mining as training matures. Experiments
on the University-1652 and SUES-200 benchmarks demonstrate the effectiveness
and robustness of the proposed DPHR, achieving consistent improvements over
state-of-the-art methods.
comment: 5 pages, 3 figures
♻ ☆ Talk2Event: Grounded Understanding of Dynamic Scenes from Event Cameras NeurIPS 2025
Lingdong Kong, Dongyue Lu, Ao Liang, Rong Li, Yuhao Dong, Tianshuai Hu, Lai Xing Ng, Wei Tsang Ooi, Benoit R. Cottereau
Event cameras offer microsecond-level latency and robustness to motion blur,
making them ideal for understanding dynamic environments. Yet, connecting these
asynchronous streams to human language remains an open challenge. We introduce
Talk2Event, the first large-scale benchmark for language-driven object
grounding in event-based perception. Built from real-world driving data, we
provide over 30,000 validated referring expressions, each enriched with four
grounding attributes -- appearance, status, relation to viewer, and relation to
other objects -- bridging spatial, temporal, and relational reasoning. To fully
exploit these cues, we propose EventRefer, an attribute-aware grounding
framework that dynamically fuses multi-attribute representations through a
Mixture of Event-Attribute Experts (MoEE). Our method adapts to different
modalities and scene dynamics, achieving consistent gains over state-of-the-art
baselines in event-only, frame-only, and event-frame fusion settings. We hope
our dataset and approach will establish a foundation for advancing multimodal,
temporally-aware, and language-driven perception in real-world robotics and
autonomy.
comment: NeurIPS 2025 Spotlight; 43 pages, 17 figures, 16 tables; Project Page
at https://talk2event.github.io
♻ ☆ FractalForensics: Proactive Deepfake Detection and Localization via Fractal Watermarks
Proactive Deepfake detection via robust watermarks has seen interest ever
since passive Deepfake detectors encountered challenges in identifying
high-quality synthetic images. However, while demonstrating reasonable
detection performance, they lack localization functionality and explainability
in detection results. Additionally, the unstable robustness of watermarks can
significantly affect the detection performance. In this study, we propose novel
fractal watermarks for proactive Deepfake detection and localization, namely
FractalForensics. Benefiting from the characteristics of fractals, we devise a
parameter-driven watermark generation pipeline that derives fractal-based
watermarks and performs one-way encryption of the selected parameters.
Subsequently, we propose a semi-fragile watermarking framework for watermark
embedding and recovery, trained to be robust against benign image processing
operations and fragile when facing Deepfake manipulations in a black-box
setting. Moreover, we introduce an entry-to-patch strategy that implicitly
embeds the watermark matrix entries into image patches at corresponding
positions, achieving localization of Deepfake manipulations. Extensive
experiments demonstrate satisfactory robustness and fragility of our approach
against common image processing operations and Deepfake manipulations,
outperforming state-of-the-art semi-fragile watermarking algorithms and passive
detectors for Deepfake detection. Furthermore, by highlighting the areas
manipulated, our method provides explainability for the proactive Deepfake
detection results.
comment: ACM Multimedia 2025 Oral
♻ ☆ Detection and Geographic Localization of Natural Objects in the Wild: A Case Study on Palms
Kangning Cui, Rongkun Zhu, Manqi Wang, Wei Tang, Gregory D. Larsen, Victor P. Pauca, Sarra Alqahtani, Fan Yang, David Segurado, David Lutz, Jean-Michel Morel, Miles R. Silman
Palms are ecologically and economically indicators of tropical forest health,
biodiversity, and human impact that support local economies and global forest
product supply chains. While palm detection in plantations is well-studied,
efforts to map naturally occurring palms in dense forests remain limited by
overlapping crowns, uneven shading, and heterogeneous landscapes. We develop
PRISM (Processing, Inference, Segmentation, and Mapping), a flexible pipeline
for detecting and localizing palms in dense tropical forests using large
orthomosaic images. Orthomosaics are created from thousands of aerial images
and spanning several to hundreds of gigabytes. Our contributions are threefold.
First, we construct a large UAV-derived orthomosaic dataset collected across 21
ecologically diverse sites in western Ecuador, annotated with 8,830 bounding
boxes and 5,026 palm center points. Second, we evaluate multiple
state-of-the-art object detectors based on efficiency and performance,
integrating zero-shot SAM 2 as the segmentation backbone, and refining the
results for precise geographic mapping. Third, we apply calibration methods to
align confidence scores with IoU and explore saliency maps for feature
explainability. Though optimized for palms, PRISM is adaptable for identifying
other natural objects, such as eastern white pines. Future work will explore
transfer learning for lower-resolution datasets (0.5 to 1m).
comment: 15 pages, 8 figures, 4 tables
♻ ☆ Towards Predicting Any Human Trajectory In Context NeurIPS 2025
Predicting accurate future trajectories of pedestrians is essential for
autonomous systems but remains a challenging task due to the need for
adaptability in different environments and domains. A common approach involves
collecting scenario-specific data and performing fine-tuning via
backpropagation. However, the need to fine-tune for each new scenario is often
impractical for deployment on edge devices. To address this challenge, we
introduce TrajICL, an In-Context Learning (ICL) framework for pedestrian
trajectory prediction that enables adaptation without fine-tuning on the
scenario-specific data at inference time without requiring weight updates. We
propose a spatio-temporal similarity-based example selection (STES) method that
selects relevant examples from previously observed trajectories within the same
scene by identifying similar motion patterns at corresponding locations. To
further refine this selection, we introduce prediction-guided example selection
(PG-ES), which selects examples based on both the past trajectory and the
predicted future trajectory, rather than relying solely on the past trajectory.
This approach allows the model to account for long-term dynamics when selecting
examples. Finally, instead of relying on small real-world datasets with limited
scenario diversity, we train our model on a large-scale synthetic dataset to
enhance its prediction ability by leveraging in-context examples. Extensive
experiments demonstrate that TrajICL achieves remarkable adaptation across both
in-domain and cross-domain scenarios, outperforming even fine-tuned approaches
across multiple public benchmarks. Project Page:
https://fujiry0.github.io/TrajICL-project-page/.
comment: NeurIPS 2025
♻ ☆ Real-Time Neural Video Compression with Unified Intra and Inter Coding
Neural video compression (NVC) technologies have advanced rapidly in recent
years, yielding state-of-the-art schemes such as DCVC-RT that offer superior
compression efficiency to H.266/VVC and real-time encoding/decoding
capabilities. Nonetheless, existing NVC schemes have several limitations,
including inefficiency in dealing with disocclusion and new content, interframe
error propagation and accumulation, among others. To eliminate these
limitations, we borrow the idea from classic video coding schemes, which allow
intra coding within inter-coded frames. With the intra coding tool enabled,
disocclusion and new content are properly handled, and interframe error
propagation is naturally intercepted without the need for manual refresh
mechanisms. We present an NVC framework with unified intra and inter coding,
where every frame is processed by a single model that is trained to perform
intra/inter coding adaptively. Moreover, we propose a simultaneous two-frame
compression design to exploit interframe redundancy not only forwardly but also
backwardly. Experimental results show that our scheme outperforms DCVC-RT by an
average of 12.1% BD-rate reduction, delivers more stable bitrate and quality
per frame, and retains real-time encoding/decoding performances. Code and
models will be released.
comment: 10 pages
♻ ☆ ID-Composer: Multi-Subject Video Synthesis with Hierarchical Identity Preservation
Panwang Pan, Jingjing Zhao, Yuchen Lin, Chenguo Lin, Chenxin Li, Haopeng Li, Honglei Yan, Tingting Shen, Yadong Mu
Video generative models pretrained on large-scale datasets can produce
high-quality videos, but are often conditioned on text or a single image,
limiting controllability and applicability. We introduce ID-Composer, a novel
framework that addresses this gap by tackling multi-subject video generation
from a text prompt and reference images. This task is challenging as it
requires preserving subject identities, integrating semantics across subjects
and modalities, and maintaining temporal consistency. To faithfully preserve
the subject consistency and textual information in synthesized videos,
ID-Composer designs a hierarchical identity-preserving attention mechanism,
which effectively aggregates features within and across subjects and
modalities. To effectively allow for the semantic following of user intention,
we introduce semantic understanding via pretrained vision-language model (VLM),
leveraging VLM's superior semantic understanding to provide fine-grained
guidance and capture complex interactions between multiple subjects.
Considering that standard diffusion loss often fails in aligning the critical
concepts like subject ID, we employ an online reinforcement learning phase to
drive the overall training objective of ID-Composer into RLVR. Extensive
experiments demonstrate that our model surpasses existing methods in identity
preservation, temporal consistency, and video quality.
♻ ☆ Joint Lossless Compression and Steganography for Medical Images via Large Language Models
Pengcheng Zheng, Xiaorong Pu, Kecheng Chen, Jiaxin Huang, Meng Yang, Bai Feng, Yazhou Ren, Jianan Jiang, Chaoning Zhang, Yang Yang, Heng Tao Shen
Recently, large language models (LLMs) have driven promising progress in
lossless image compression. However, directly adopting existing paradigms for
medical images suffers from an unsatisfactory trade-off between compression
performance and efficiency. Moreover, existing LLM-based compressors often
overlook the security of the compression process, which is critical in modern
medical scenarios. To this end, we propose a novel joint lossless compression
and steganography framework. Inspired by bit plane slicing (BPS), we find it
feasible to securely embed privacy messages into medical images in an invisible
manner. Based on this insight, an adaptive modalities decomposition strategy is
first devised to partition the entire image into two segments, providing global
and local modalities for subsequent dual-path lossless compression. During this
dual-path stage, we innovatively propose a segmented message steganography
algorithm within the local modality path to ensure the security of the
compression process. Coupled with the proposed anatomical priors-based low-rank
adaptation (A-LoRA) fine-tuning strategy, extensive experimental results
demonstrate the superiority of our proposed method in terms of compression
ratios, efficiency, and security. The source code will be made publicly
available.
♻ ☆ Learning with Category-Equivariant Architectures for Human Activity Recognition
We propose CatEquiv, a category-equivariant neural network for Human Activity
Recognition (HAR) from inertial sensors that systematically encodes temporal,
amplitude, and structural symmetries. We introduce a symmetry category that
jointly represents cyclic time shifts, positive gain scalings, and the
sensor-hierarchy poset, capturing the categorical symmetry structure of the
data. CatEquiv achieves equivariance with respect to the categorical symmetry
product. On UCI-HAR under out-of-distribution perturbations, CatEquiv attains
markedly higher robustness compared with circularly padded CNNs and plain CNNs.
These results demonstrate that enforcing categorical symmetries yields strong
invariance and generalization without additional model capacity.
♻ ☆ WXSOD: A Benchmark for Robust Salient Object Detection in Adverse Weather Conditions
Salient object detection (SOD) in complex environments remains a challenging
research topic. Most existing methods perform well in natural scenes with
negligible noise, and tend to leverage multi-modal information (e.g., depth and
infrared) to enhance accuracy. However, few studies are concerned with the
damage of weather noise on SOD performance due to the lack of dataset with
pixel-wise annotations. To bridge this gap, this paper introduces a novel
Weather-eXtended Salient Object Detection (WXSOD) dataset. It consists of
14,945 RGB images with diverse weather noise, along with the corresponding
ground truth annotations and weather labels. To verify algorithm
generalization, WXSOD contains two test sets, i.e., a synthesized test set and
a real test set. The former is generated by adding weather noise to clean
images, while the latter contains real-world weather noise. Based on WXSOD, we
propose an efficient baseline, termed Weather-aware Feature Aggregation Network
(WFANet), which adopts a fully supervised two-branch architecture.
Specifically, the weather prediction branch mines weather-related deep
features, while the saliency detection branch fuses semantic features extracted
from the backbone with weather features for SOD. Comprehensive comparisons
against 17 SOD methods shows that our WFANet achieves superior performance on
WXSOD. The code and benchmark results will be made publicly available at
https://github.com/C-water/WXSOD
comment: Under review
♻ ☆ CrossRay3D: Geometry and Distribution Guidance for Efficient Multimodal 3D Detection
Huiming Yang, Wenzhuo Liu, Yicheng Qiao, Lei Yang, Xianzhu Zeng, Li Wang, Zhiwei Li, Zijian Zeng, Zhiying Jiang, Huaping Liu, Kunfeng Wang
The sparse cross-modality detector offers more advantages than its
counterpart, the Bird's-Eye-View (BEV) detector, particularly in terms of
adaptability for downstream tasks and computational cost savings. However,
existing sparse detectors overlook the quality of token representation, leaving
it with a sub-optimal foreground quality and limited performance. In this
paper, we identify that the geometric structure preserved and the class
distribution are the key to improving the performance of the sparse detector,
and propose a Sparse Selector (SS). The core module of SS is Ray-Aware
Supervision (RAS), which preserves rich geometric information during the
training stage, and Class-Balanced Supervision, which adaptively reweights the
salience of class semantics, ensuring that tokens associated with small objects
are retained during token sampling. Thereby, outperforming other sparse
multi-modal detectors in the representation of tokens. Additionally, we design
Ray Positional Encoding (Ray PE) to address the distribution differences
between the LiDAR modality and the image. Finally, we integrate the
aforementioned module into an end-to-end sparse multi-modality detector, dubbed
CrossRay3D. Experiments show that, on the challenging nuScenes benchmark,
CrossRay3D achieves state-of-the-art performance with 72.4 mAP and 74.7 NDS,
while running 1.84 faster than other leading methods. Moreover, CrossRay3D
demonstrates strong robustness even in scenarios where LiDAR or camera data are
partially or entirely missing.
comment: 13 pages
♻ ☆ DiffVLA++: Bridging Cognitive Reasoning and End-to-End Driving through Metric-Guided Alignment
Yu Gao, Anqing Jiang, Yiru Wang, Wang Jijun, Hao Jiang, Zhigang Sun, Heng Yuwen, Wang Shuo, Hao Zhao, Sun Hao
Conventional end-to-end (E2E) driving models are effective at generating
physically plausible trajectories, but often fail to generalize to long-tail
scenarios due to the lack of essential world knowledge to understand and reason
about surrounding environments. In contrast, Vision-Language-Action (VLA)
models leverage world knowledge to handle challenging cases, but their limited
3D reasoning capability can lead to physically infeasible actions. In this work
we introduce DiffVLA++, an enhanced autonomous driving framework that
explicitly bridges cognitive reasoning and E2E planning through metric-guided
alignment. First, we build a VLA module directly generating semantically
grounded driving trajectories. Second, we design an E2E module with a dense
trajectory vocabulary that ensures physical feasibility. Third, and most
critically, we introduce a metric-guided trajectory scorer that guides and
aligns the outputs of the VLA and E2E modules, thereby integrating their
complementary strengths. The experiment on the ICCV 2025 Autonomous Grand
Challenge leaderboard shows that DiffVLA++ achieves EPDMS of 49.12.
♻ ☆ 3DBonsai: Structure-Aware Bonsai Modeling Using Conditioned 3D Gaussian Splatting
Recent advancements in text-to-3D generation have shown remarkable results by
leveraging 3D priors in combination with 2D diffusion. However, previous
methods utilize 3D priors that lack detailed and complex structural
information, limiting them to generating simple objects and presenting
challenges for creating intricate structures such as bonsai. In this paper, we
propose 3DBonsai, a novel text-to-3D framework for generating 3D bonsai with
complex structures. Technically, we first design a trainable 3D space
colonization algorithm to produce bonsai structures, which are then enhanced
through random sampling and point cloud augmentation to serve as the 3D
Gaussian priors. We introduce two bonsai generation pipelines with distinct
structural levels: fine structure conditioned generation, which initializes 3D
Gaussians using a 3D structure prior to produce detailed and complex bonsai,
and coarse structure conditioned generation, which employs a multi-view
structure consistency module to align 2D and 3D structures. Moreover, we have
compiled a unified 2D and 3D Chinese-style bonsai dataset. Our experimental
results demonstrate that 3DBonsai significantly outperforms existing methods,
providing a new benchmark for structure-aware 3D bonsai generation.
♻ ☆ DRIP: Dynamic patch Reduction via Interpretable Pooling
Recently, the advances in vision-language models, including contrastive
pretraining and instruction tuning, have greatly pushed the frontier of
multimodal AI. However, owing to the large-scale and hence expensive
pretraining, the efficiency concern has discouraged researchers from attempting
to pretrain a vision language model from scratch. In this work, we propose
Dynamic patch Reduction via Interpretable Pooling (DRIP), which adapts to the
input images and dynamically merges tokens in the deeper layers of a visual
encoder. Our results on both ImageNet training from scratch and CLIP
contrastive pretraining demonstrate a significant GFLOP reduction while
maintaining comparable classification/zero-shot performance. To further
validate our proposed method, we conduct continual pretraining on a large
biology dataset, extending its impact into scientific domains.
comment: Need more refinement
♻ ☆ Weakly Supervised Object Segmentation by Background Conditional Divergence
As a computer vision task, automatic object segmentation remains challenging
in specialized image domains without massive labeled data, such as synthetic
aperture sonar images, remote sensing, biomedical imaging, etc. In any domain,
obtaining pixel-wise segmentation masks is expensive. In this work, we propose
a method for training a masking network to perform binary object segmentation
using weak supervision in the form of image-wise presence or absence of an
object of interest, which provides less information but may be obtained more
quickly from manual or automatic labeling. A key step in our method is that the
segmented objects can be placed into background-only images to create realistic
images of the objects with counterfactual backgrounds. To create a contrast
between the original and counterfactual background images, we propose to first
cluster the background-only images and then, during learning, create
counterfactual images that blend objects segmented from their original source
backgrounds to backgrounds chosen from a targeted cluster. One term in the
training loss is the divergence between these counterfactual images and the
real object images with backgrounds of the target cluster. The other term is a
supervised loss for background-only images. While an adversarial critic could
provide the divergence, we use sample-based divergences. We conduct experiments
on side-scan and synthetic aperture sonar in which our approach succeeds
compared to previous unsupervised segmentation baselines that were only tested
on natural images. Furthermore, to show generality we extend our experiments to
natural images, obtaining reasonable performance with our method that avoids
pretrained networks, generative networks, and adversarial critics. The code for
this work can be found at \href{GitHub}{https://github.com/bakerhassan/WSOS}.
comment: Published in TMLR: https://openreview.net/forum?id=2JJZhfGvMW
♻ ☆ Latent Zoning Network: A Unified Principle for Generative Modeling, Representation Learning, and Classification NeurIPS 2025
Generative modeling, representation learning, and classification are three
core problems in machine learning (ML), yet their state-of-the-art (SoTA)
solutions remain largely disjoint. In this paper, we ask: Can a unified
principle address all three? Such unification could simplify ML pipelines and
foster greater synergy across tasks. We introduce Latent Zoning Network (LZN)
as a step toward this goal. At its core, LZN creates a shared Gaussian latent
space that encodes information across all tasks. Each data type (e.g., images,
text, labels) is equipped with an encoder that maps samples to disjoint latent
zones, and a decoder that maps latents back to data. ML tasks are expressed as
compositions of these encoders and decoders: for example, label-conditional
image generation uses a label encoder and image decoder; image embedding uses
an image encoder; classification uses an image encoder and label decoder. We
demonstrate the promise of LZN in three increasingly complex scenarios: (1) LZN
can enhance existing models (image generation): When combined with the SoTA
Rectified Flow model, LZN improves FID on CIFAR10 from 2.76 to 2.59-without
modifying the training objective. (2) LZN can solve tasks independently
(representation learning): LZN can implement unsupervised representation
learning without auxiliary loss functions, outperforming the seminal MoCo and
SimCLR methods by 9.3% and 0.2%, respectively, on downstream linear
classification on ImageNet. (3) LZN can solve multiple tasks simultaneously
(joint generation and classification): With image and label encoders/decoders,
LZN performs both tasks jointly by design, improving FID and achieving SoTA
classification accuracy on CIFAR10. The code and trained models are available
at https://github.com/microsoft/latent-zoning-networks. The project website is
at https://zinanlin.me/blogs/latent_zoning_networks.html.
comment: Published in NeurIPS 2025
♻ ☆ Visual Program Distillation with Template-Based Augmentation EMNLP
Adapting visual programming or prompting large language models (LLMs) to
generate executable code for visual tasks like visual question answering (VQA)
for specialized tasks or domains remains challenging due to high annotation and
inference costs. We propose a low-cost visual program distillation method that
can be used for models with at most 1 billion parameters and requires no
human-generated program annotations. We achieve this through synthetic data
augmentation based on decoupling programs into higher-level skills, called
templates, and their corresponding arguments. Experimental results show that,
with a relatively small amount of question/answer data, small language models
can generate high-quality specialized visual programs with the added benefit of
much faster inference
comment: EMNLP Camera Ready