Computer Vision and Pattern Recognition 150
☆ LongSplat: Robust Unposed 3D Gaussian Splatting for Casual Long Videos ICCV 2025
LongSplat addresses critical challenges in novel view synthesis (NVS) from
casually captured long videos characterized by irregular camera motion, unknown
camera poses, and expansive scenes. Current methods often suffer from pose
drift, inaccurate geometry initialization, and severe memory limitations. To
address these issues, we introduce LongSplat, a robust unposed 3D Gaussian
Splatting framework featuring: (1) Incremental Joint Optimization that
concurrently optimizes camera poses and 3D Gaussians to avoid local minima and
ensure global consistency; (2) a robust Pose Estimation Module leveraging
learned 3D priors; and (3) an efficient Octree Anchor Formation mechanism that
converts dense point clouds into anchors based on spatial density. Extensive
experiments on challenging benchmarks demonstrate that LongSplat achieves
state-of-the-art results, substantially improving rendering quality, pose
accuracy, and computational efficiency compared to prior approaches. Project
page: https://linjohnss.github.io/longsplat/
comment: ICCV 2025. Project page: https://linjohnss.github.io/longsplat/
☆ Beyond Simple Edits: Composed Video Retrieval with Dense Modifications ICCV-2025
Omkar Thawakar, Dmitry Demidov, Ritesh Thawkar, Rao Muhammad Anwer, Mubarak Shah, Fahad Shahbaz Khan, Salman Khan
Composed video retrieval is a challenging task that strives to retrieve a
target video based on a query video and a textual description detailing
specific modifications. Standard retrieval frameworks typically struggle to
handle the complexity of fine-grained compositional queries and variations in
temporal understanding limiting their retrieval ability in the fine-grained
setting. To address this issue, we introduce a novel dataset that captures both
fine-grained and composed actions across diverse video segments, enabling more
detailed compositional changes in retrieved video content. The proposed
dataset, named Dense-WebVid-CoVR, consists of 1.6 million samples with dense
modification text that is around seven times more than its existing
counterpart. We further develop a new model that integrates visual and textual
information through Cross-Attention (CA) fusion using grounded text encoder,
enabling precise alignment between dense query modifications and target videos.
The proposed model achieves state-of-the-art results surpassing existing
methods on all metrics. Notably, it achieves 71.3\% Recall@1 in visual+text
setting and outperforms the state-of-the-art by 3.4\%, highlighting its
efficacy in terms of leveraging detailed video descriptions and dense
modification texts. Our proposed dataset, code, and model are available at
:https://github.com/OmkarThawakar/BSE-CoVR
comment: Accepted to ICCV-2025
☆ Distilled-3DGS:Distilled 3D Gaussian Splatting
3D Gaussian Splatting (3DGS) has exhibited remarkable efficacy in novel view
synthesis (NVS). However, it suffers from a significant drawback: achieving
high-fidelity rendering typically necessitates a large number of 3D Gaussians,
resulting in substantial memory consumption and storage requirements. To
address this challenge, we propose the first knowledge distillation framework
for 3DGS, featuring various teacher models, including vanilla 3DGS,
noise-augmented variants, and dropout-regularized versions. The outputs of
these teachers are aggregated to guide the optimization of a lightweight
student model. To distill the hidden geometric structure, we propose a
structural similarity loss to boost the consistency of spatial geometric
distributions between the student and teacher model. Through comprehensive
quantitative and qualitative evaluations across diverse datasets, the proposed
Distilled-3DGS, a simple yet effective framework without bells and whistles,
achieves promising rendering results in both rendering quality and storage
efficiency compared to state-of-the-art methods. Project page:
https://distilled3dgs.github.io . Code:
https://github.com/lt-xiang/Distilled-3DGS .
comment: Project page: https://distilled3dgs.github.io Code:
https://github.com/lt-xiang/Distilled-3DGS
☆ GeoSAM2: Unleashing the Power of SAM2 for 3D Part Segmentation
Modern 3D generation methods can rapidly create shapes from sparse or single
views, but their outputs often lack geometric detail due to computational
constraints. We present DetailGen3D, a generative approach specifically
designed to enhance these generated 3D shapes. Our key insight is to model the
coarse-to-fine transformation directly through data-dependent flows in latent
space, avoiding the computational overhead of large-scale 3D generative models.
We introduce a token matching strategy that ensures accurate spatial
correspondence during refinement, enabling local detail synthesis while
preserving global structure. By carefully designing our training data to match
the characteristics of synthesized coarse shapes, our method can effectively
enhance shapes produced by various 3D generation and reconstruction approaches,
from single-view to sparse multi-view inputs. Extensive experiments demonstrate
that DetailGen3D achieves high-fidelity geometric detail synthesis while
maintaining efficiency in training.
comment: https://detailgen3d.github.io/GeoSAM2/
☆ InfiniteTalk: Audio-driven Video Generation for Sparse-Frame Video Dubbing
Shaoshu Yang, Zhe Kong, Feng Gao, Meng Cheng, Xiangyu Liu, Yong Zhang, Zhuoliang Kang, Wenhan Luo, Xunliang Cai, Ran He, Xiaoming Wei
Recent breakthroughs in video AIGC have ushered in a transformative era for
audio-driven human animation. However, conventional video dubbing techniques
remain constrained to mouth region editing, resulting in discordant facial
expressions and body gestures that compromise viewer immersion. To overcome
this limitation, we introduce sparse-frame video dubbing, a novel paradigm that
strategically preserves reference keyframes to maintain identity, iconic
gestures, and camera trajectories while enabling holistic, audio-synchronized
full-body motion editing. Through critical analysis, we identify why naive
image-to-video models fail in this task, particularly their inability to
achieve adaptive conditioning. Addressing this, we propose InfiniteTalk, a
streaming audio-driven generator designed for infinite-length long sequence
dubbing. This architecture leverages temporal context frames for seamless
inter-chunk transitions and incorporates a simple yet effective sampling
strategy that optimizes control strength via fine-grained reference frame
positioning. Comprehensive evaluations on HDTF, CelebV-HQ, and EMTD datasets
demonstrate state-of-the-art performance. Quantitative metrics confirm superior
visual realism, emotional coherence, and full-body motion synchronization.
comment: 11 pages, 7 figures
☆ UNICON: UNIfied CONtinual Learning for Medical Foundational Models
Foundational models are trained on extensive datasets to capture the general
trends of a domain. However, in medical imaging, the scarcity of data makes
pre-training for every domain, modality, or task challenging. Continual
learning offers a solution by fine-tuning a model sequentially on different
domains or tasks, enabling it to integrate new knowledge without requiring
large datasets for each training phase. In this paper, we propose UNIfied
CONtinual Learning for Medical Foundational Models (UNICON), a framework that
enables the seamless adaptation of foundation models to diverse domains, tasks,
and modalities. Unlike conventional adaptation methods that treat these changes
in isolation, UNICON provides a unified, perpetually expandable framework.
Through careful integration, we show that foundation models can dynamically
expand across imaging modalities, anatomical regions, and clinical objectives
without catastrophic forgetting or task interference. Empirically, we validate
our approach by adapting a chest CT foundation model initially trained for
classification to a prognosis and segmentation task. Our results show improved
performance across both additional tasks. Furthermore, we continually
incorporated PET scans and achieved a 5\% improvement in Dice score compared to
respective baselines. These findings establish that foundation models are not
inherently constrained to their initial training scope but can evolve, paving
the way toward generalist AI models for medical imaging.
comment: 10 pages, 1 figure
☆ Backdooring Self-Supervised Contrastive Learning by Noisy Alignment ICCV 2025
Self-supervised contrastive learning (CL) effectively learns transferable
representations from unlabeled data containing images or image-text pairs but
suffers vulnerability to data poisoning backdoor attacks (DPCLs). An adversary
can inject poisoned images into pretraining datasets, causing compromised CL
encoders to exhibit targeted misbehavior in downstream tasks. Existing DPCLs,
however, achieve limited efficacy due to their dependence on fragile implicit
co-occurrence between backdoor and target object and inadequate suppression of
discriminative features in backdoored images. We propose Noisy Alignment (NA),
a DPCL method that explicitly suppresses noise components in poisoned images.
Inspired by powerful training-controllable CL attacks, we identify and extract
the critical objective of noisy alignment, adapting it effectively into
data-poisoning scenarios. Our method implements noisy alignment by
strategically manipulating contrastive learning's random cropping mechanism,
formulating this process as an image layout optimization problem with
theoretically derived optimal parameters. The resulting method is simple yet
effective, achieving state-of-the-art performance compared to existing DPCLs,
while maintaining clean-data accuracy. Furthermore, Noisy Alignment
demonstrates robustness against common backdoor defenses. Codes can be found at
https://github.com/jsrdcht/Noisy-Alignment.
comment: Accepted by ICCV 2025
☆ Online 3D Gaussian Splatting Modeling with Novel View Selection
This study addresses the challenge of generating online 3D Gaussian Splatting
(3DGS) models from RGB-only frames. Previous studies have employed dense SLAM
techniques to estimate 3D scenes from keyframes for 3DGS model construction.
However, these methods are limited by their reliance solely on keyframes, which
are insufficient to capture an entire scene, resulting in incomplete
reconstructions. Moreover, building a generalizable model requires
incorporating frames from diverse viewpoints to achieve broader scene coverage.
However, online processing restricts the use of many frames or extensive
training iterations. Therefore, we propose a novel method for high-quality 3DGS
modeling that improves model completeness through adaptive view selection. By
analyzing reconstruction quality online, our approach selects optimal
non-keyframes for additional training. By integrating both keyframes and
selected non-keyframes, the method refines incomplete regions from diverse
viewpoints, significantly enhancing completeness. We also present a framework
that incorporates an online multi-view stereo approach, ensuring consistency in
3D information throughout the 3DGS modeling process. Experimental results
demonstrate that our method outperforms state-of-the-art methods, delivering
exceptional performance in complex outdoor scenes.
☆ ResPlan: A Large-Scale Vector-Graph Dataset of 17,000 Residential Floor Plans
We introduce ResPlan, a large-scale dataset of 17,000 detailed, structurally
rich, and realistic residential floor plans, created to advance spatial AI
research. Each plan includes precise annotations of architectural elements
(walls, doors, windows, balconies) and functional spaces (such as kitchens,
bedrooms, and bathrooms). ResPlan addresses key limitations of existing
datasets such as RPLAN (Wu et al., 2019) and MSD (van Engelenburg et al., 2024)
by offering enhanced visual fidelity and greater structural diversity,
reflecting realistic and non-idealized residential layouts. Designed as a
versatile, general-purpose resource, ResPlan supports a wide range of
applications including robotics, reinforcement learning, generative AI, virtual
and augmented reality, simulations, and game development. Plans are provided in
both geometric and graph-based formats, enabling direct integration into
simulation engines and fast 3D conversion. A key contribution is an open-source
pipeline for geometry cleaning, alignment, and annotation refinement.
Additionally, ResPlan includes structured representations of room connectivity,
supporting graph-based spatial reasoning tasks. Finally, we present comparative
analyses with existing benchmarks and outline several open benchmark tasks
enabled by ResPlan. Ultimately, ResPlan offers a significant advance in scale,
realism, and usability, providing a robust foundation for developing and
benchmarking next-generation spatial intelligence systems.
comment: 18 pages, 3 figures, 4 tables
☆ Self-Supervised Sparse Sensor Fusion for Long Range Perception
Outside of urban hubs, autonomous cars and trucks have to master driving on
intercity highways. Safe, long-distance highway travel at speeds exceeding 100
km/h demands perception distances of at least 250 m, which is about five times
the 50-100m typically addressed in city driving, to allow sufficient planning
and braking margins. Increasing the perception ranges also allows to extend
autonomy from light two-ton passenger vehicles to large-scale forty-ton trucks,
which need a longer planning horizon due to their high inertia. However, most
existing perception approaches focus on shorter ranges and rely on Bird's Eye
View (BEV) representations, which incur quadratic increases in memory and
compute costs as distance grows. To overcome this limitation, we built on top
of a sparse representation and introduced an efficient 3D encoding of
multi-modal and temporal features, along with a novel self-supervised
pre-training scheme that enables large-scale learning from unlabeled
camera-LiDAR data. Our approach extends perception distances to 250 meters and
achieves an 26.6% improvement in mAP in object detection and a decrease of
30.5% in Chamfer Distance in LiDAR forecasting compared to existing methods,
reaching distances up to 250 meters. Project Page:
https://light.princeton.edu/lrs4fusion/
☆ Physics-Based 3D Simulation for Synthetic Data Generation and Failure Analysis in Packaging Stability Assessment
The design and analysis of pallet setups are essential for ensuring safety of
packages transportation. With rising demands in the logistics sector, the
development of automated systems utilizing advanced technologies has become
increasingly crucial. Moreover, the widespread use of plastic wrapping has
motivated researchers to investigate eco-friendly alternatives that still
adhere to safety standards. We present a fully controllable and accurate
physical simulation system capable of replicating the behavior of moving
pallets. It features a 3D graphics-based virtual environment that supports a
wide range of configurations, including variable package layouts, different
wrapping materials, and diverse dynamic conditions. This innovative approach
reduces the need for physical testing, cutting costs and environmental impact
while improving measurement accuracy for analyzing pallet dynamics.
Additionally, we train a deep neural network to evaluate the rendered videos
generated by our simulator, as a crash-test predictor for pallet
configurations, further enhancing the system's utility in safety analysis.
☆ OmViD: Omni-supervised active learning for video action detection ICCV
Video action detection requires dense spatio-temporal annotations, which are
both challenging and expensive to obtain. However, real-world videos often vary
in difficulty and may not require the same level of annotation. This paper
analyzes the appropriate annotation types for each sample and their impact on
spatio-temporal video action detection. It focuses on two key aspects: 1) how
to obtain varying levels of annotation for videos, and 2) how to learn action
detection from different annotation types. The study explores video-level tags,
points, scribbles, bounding boxes, and pixel-level masks. First, a simple
active learning strategy is proposed to estimate the necessary annotation type
for each video. Then, a novel spatio-temporal 3D-superpixel approach is
introduced to generate pseudo-labels from these annotations, enabling effective
training. The approach is validated on UCF101-24 and JHMDB-21 datasets,
significantly cutting annotation costs with minimal performance loss.
comment: ICCVW'25
☆ ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving
Xianda Guo, Ruijun Zhang, Yiqun Duan, Ruilin Wang, Keyuan Zhou, Wenzhao Zheng, Wenke Huang, Gangwei Xu, Mike Horton, Yuan Si, Hao Zhao, Long Chen
Depth estimation is a fundamental task for 3D scene understanding in
autonomous driving, robotics, and augmented reality. Existing depth datasets,
such as KITTI, nuScenes, and DDAD, have advanced the field but suffer from
limitations in diversity and scalability. As benchmark performance on these
datasets approaches saturation, there is an increasing need for a new
generation of large-scale, diverse, and cost-efficient datasets to support the
era of foundation models and multi-modal learning. To address these challenges,
we introduce a large-scale, diverse, frame-wise continuous dataset for depth
estimation in dynamic outdoor driving environments, comprising 20K video frames
to evaluate existing methods. Our lightweight acquisition pipeline ensures
broad scene coverage at low cost, while sparse yet statistically sufficient
ground truth enables robust training. Compared to existing datasets, ours
presents greater diversity in driving scenarios and lower depth density,
creating new challenges for generalization. Benchmark experiments with standard
monocular depth estimation models validate the dataset's utility and highlight
substantial performance gaps in challenging conditions, establishing a new
platform for advancing depth estimation research.
☆ RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation
We investigate to what extent Multimodal Large Language Models (MLLMs) can
accurately identify the orientation of input images rotated 0{\deg}, 90{\deg},
180{\deg}, and 270{\deg}. This task demands robust visual reasoning
capabilities to detect rotational cues and contextualize spatial relationships
within images, regardless of their orientation. To evaluate MLLMs on these
abilities, we introduce RotBench -- a 350-image manually-filtered benchmark
comprising lifestyle, portrait, and landscape images. Despite the relatively
simple nature of this task, we show that several state-of-the-art open and
proprietary MLLMs, including GPT-5, o3, and Gemini-2.5-Pro, do not reliably
identify rotation in input images. Providing models with auxiliary information
-- including captions, depth maps, and more -- or using chain-of-thought
prompting offers only small and inconsistent improvements. Our results indicate
that most models are able to reliably identify right-side-up (0{\deg}) images,
while certain models are able to identify upside-down (180{\deg}) images. None
can reliably distinguish between 90{\deg} and 270{\deg}. Simultaneously showing
the image rotated in different orientations leads to moderate performance gains
for reasoning models, while a modified setup using voting improves the
performance of weaker models. We further show that fine-tuning does not improve
models' ability to distinguish 90{\deg} and 270{\deg} rotations, despite
substantially improving the identification of 180{\deg} images. Together, these
results reveal a significant gap between MLLMs' spatial reasoning capabilities
and human perception in identifying rotation.
comment: 20 pages. Code and data: https://github.com/tianyiniu/RotBench
☆ Augmenting cobots for sheet-metal SMEs with 3D object recognition and localisation
Due to high-mix-low-volume production, sheet-metal workshops today are
challenged by small series and varying orders. As standard automation solutions
tend to fall short, SMEs resort to repetitive manual labour impacting
production costs and leading to tech-skilled workforces not being used to their
full potential. The COOCK+ ROBUST project aims to transform cobots into mobile
and reconfigurable production assistants by integrating existing technologies,
including 3D object recognition and localisation. This article explores both
the opportunities and challenges of enhancing cobotic systems with these
technologies in an industrial setting, outlining the key steps involved in the
process. Additionally, insights from a past project, carried out by the ACRO
research unit in collaboration with an industrial partner, serves as a concrete
implementation example throughout.
comment: 13 pages, 25 figures
☆ ViT-FIQA: Assessing Face Image Quality using Vision Transformers ICCV
Face Image Quality Assessment (FIQA) aims to predict the utility of a face
image for face recognition (FR) systems. State-of-the-art FIQA methods mainly
rely on convolutional neural networks (CNNs), leaving the potential of Vision
Transformer (ViT) architectures underexplored. This work proposes ViT-FIQA, a
novel approach that extends standard ViT backbones, originally optimized for
FR, through a learnable quality token designed to predict a scalar utility
score for any given face image. The learnable quality token is concatenated
with the standard image patch tokens, and the whole sequence is processed via
global self-attention by the ViT encoders to aggregate contextual information
across all patches. At the output of the backbone, ViT-FIQA branches into two
heads: (1) the patch tokens are passed through a fully connected layer to learn
discriminative face representations via a margin-penalty softmax loss, and (2)
the quality token is fed into a regression head to learn to predict the face
sample's utility. Extensive experiments on challenging benchmarks and several
FR models, including both CNN- and ViT-based architectures, demonstrate that
ViT-FIQA consistently achieves top-tier performance. These results underscore
the effectiveness of transformer-based architectures in modeling face image
utility and highlight the potential of ViTs as a scalable foundation for future
FIQA research https://cutt.ly/irHlzXUC.
comment: Accepted at the IEEE/CVF International Conference on Computer Vision
Workshops 2025 (ICCVW 2025)
☆ Real-Time, Population-Based Reconstruction of 3D Bone Models via Very-Low-Dose Protocols
Yiqun Lin, Haoran Sun, Yongqing Li, Rabia Aslam, Lung Fung Tse, Tiange Cheng, Chun Sing Chui, Wing Fung Yau, Victorine R. Le Meur, Meruyert Amangeldy, Kiho Cho, Yinyu Ye, James Zou, Wei Zhao, Xiaomeng Li
Patient-specific bone models are essential for designing surgical guides and
preoperative planning, as they enable the visualization of intricate anatomical
structures. However, traditional CT-based approaches for creating bone models
are limited to preoperative use due to the low flexibility and high radiation
exposure of CT and time-consuming manual delineation. Here, we introduce
Semi-Supervised Reconstruction with Knowledge Distillation (SSR-KD), a fast and
accurate AI framework to reconstruct high-quality bone models from biplanar
X-rays in 30 seconds, with an average error under 1.0 mm, eliminating the
dependence on CT and manual work. Additionally, high tibial osteotomy
simulation was performed by experts on reconstructed bone models, demonstrating
that bone models reconstructed from biplanar X-rays have comparable clinical
applicability to those annotated from CT. Overall, our approach accelerates the
process, reduces radiation exposure, enables intraoperative guidance, and
significantly improves the practicality of bone models, offering transformative
applications in orthopedics.
☆ MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
Recently, multimodal large language models (MLLMs) have achieved significant
advancements across various domains, and corresponding evaluation benchmarks
have been continuously refined and improved. In this process, benchmarks in the
scientific domain have played an important role in assessing the reasoning
capabilities of MLLMs. However, existing benchmarks still face three key
challenges: 1) Insufficient evaluation of models' reasoning abilities in
multilingual scenarios; 2) Inadequate assessment of MLLMs' comprehensive
modality coverage; 3) Lack of fine-grained annotation of scientific knowledge
points. To address these gaps, we propose MME-SCI, a comprehensive and
challenging benchmark. We carefully collected 1,019 high-quality
question-answer pairs, which involve 3 distinct evaluation modes. These pairs
cover four subjects, namely mathematics, physics, chemistry, and biology, and
support five languages: Chinese, English, French, Spanish, and Japanese. We
conducted extensive experiments on 16 open-source models and 4 closed-source
models, and the results demonstrate that MME-SCI is widely challenging for
existing MLLMs. For instance, under the Image-only evaluation mode, o4-mini
achieved accuracy of only 52.11%, 24.73%, 36.57%, and 29.80% in mathematics,
physics, chemistry, and biology, respectively, indicating a significantly
higher difficulty level compared to existing benchmarks. More importantly,
using MME-SCI's multilingual and fine-grained knowledge attributes, we analyzed
existing models' performance in depth and identified their weaknesses in
specific domains. The Data and Evaluation Code are available at
https://github.com/JCruan519/MME-SCI.
comment: 9 pages, 6 figures, work in progress
☆ MMIS-Net for Retinal Fluid Segmentation and Detection
Purpose: Deep learning methods have shown promising results in the
segmentation, and detection of diseases in medical images. However, most
methods are trained and tested on data from a single source, modality, organ,
or disease type, overlooking the combined potential of other available
annotated data. Numerous small annotated medical image datasets from various
modalities, organs, and diseases are publicly available. In this work, we aim
to leverage the synergistic potential of these datasets to improve performance
on unseen data. Approach: To this end, we propose a novel algorithm called
MMIS-Net (MultiModal Medical Image Segmentation Network), which features
Similarity Fusion blocks that utilize supervision and pixel-wise similarity
knowledge selection for feature map fusion. Additionally, to address
inconsistent class definitions and label contradictions, we created a one-hot
label space to handle classes absent in one dataset but annotated in another.
MMIS-Net was trained on 10 datasets encompassing 19 organs across 2 modalities
to build a single model. Results: The algorithm was evaluated on the RETOUCH
grand challenge hidden test set, outperforming large foundation models for
medical image segmentation and other state-of-the-art algorithms. We achieved
the best mean Dice score of 0.83 and an absolute volume difference of 0.035 for
the fluids segmentation task, as well as a perfect Area Under the Curve of 1
for the fluid detection task. Conclusion: The quantitative results highlight
the effectiveness of our proposed model due to the incorporation of Similarity
Fusion blocks into the network's backbone for supervision and similarity
knowledge selection, and the use of a one-hot label space to address label
class inconsistencies and contradictions.
☆ DIME-Net: A Dual-Illumination Adaptive Enhancement Network Based on Retinex and Mixture-of-Experts ACM MM 2025
Image degradation caused by complex lighting conditions such as low-light and
backlit scenarios is commonly encountered in real-world environments,
significantly affecting image quality and downstream vision tasks. Most
existing methods focus on a single type of illumination degradation and lack
the ability to handle diverse lighting conditions in a unified manner. To
address this issue, we propose a dual-illumination enhancement framework called
DIME-Net. The core of our method is a Mixture-of-Experts illumination estimator
module, where a sparse gating mechanism adaptively selects suitable S-curve
expert networks based on the illumination characteristics of the input image.
By integrating Retinex theory, this module effectively performs enhancement
tailored to both low-light and backlit images. To further correct
illumination-induced artifacts and color distortions, we design a damage
restoration module equipped with Illumination-Aware Cross Attention and
Sequential-State Global Attention mechanisms. In addition, we construct a
hybrid illumination dataset, MixBL, by integrating existing datasets, allowing
our model to achieve robust illumination adaptability through a single training
process. Experimental results show that DIME-Net achieves competitive
performance on both synthetic and real-world low-light and backlit datasets
without any retraining. These results demonstrate its generalization ability
and potential for practical multimedia applications under diverse and complex
illumination conditions.
comment: Accepted at ACM Multimedia 2025 (ACM MM 2025)
☆ PhysGM: Large Physical Gaussian Model for Feed-Forward 4D Synthesis
While physics-grounded 3D motion synthesis has seen significant progress,
current methods face critical limitations. They typically rely on
pre-reconstructed 3D Gaussian Splatting (3DGS) representations, while physics
integration depends on either inflexible, manually defined physical attributes
or unstable, optimization-heavy guidance from video models. To overcome these
challenges, we introduce PhysGM, a feed-forward framework that jointly predicts
a 3D Gaussian representation and its physical properties from a single image,
enabling immediate, physical simulation and high-fidelity 4D rendering. We
first establish a base model by jointly optimizing for Gaussian reconstruction
and probabilistic physics prediction. The model is then refined with physically
plausible reference videos to enhance both rendering fidelity and physics
prediction accuracy. We adopt the Direct Preference Optimization (DPO) to align
its simulations with reference videos, circumventing Score Distillation
Sampling (SDS) optimization which needs back-propagating gradients through the
complex differentiable simulation and rasterization. To facilitate the
training, we introduce a new dataset PhysAssets of over 24,000 3D assets,
annotated with physical properties and corresponding guiding videos.
Experimental results demonstrate that our method effectively generates
high-fidelity 4D simulations from a single image in one minute. This represents
a significant speedup over prior works while delivering realistic rendering
results. Our project page is at:https://hihixiaolv.github.io/PhysGM.github.io/
☆ Learning to See Through Flare ICCV
Machine vision systems are susceptible to laser flare, where unwanted intense
laser illumination blinds and distorts its perception of the environment
through oversaturation or permanent damage to sensor pixels. We introduce
NeuSee, the first computational imaging framework for high-fidelity sensor
protection across the full visible spectrum. It jointly learns a neural
representation of a diffractive optical element (DOE) and a frequency-space
Mamba-GAN network for image restoration. NeuSee system is adversarially trained
end-to-end on 100K unique images to suppress the peak laser irradiance as high
as $10^6$ times the sensor saturation threshold $I_{\textrm{sat}}$, the point
at which camera sensors may experience damage without the DOE. Our system
leverages heterogeneous data and model parallelism for distributed computing,
integrating hyperspectral information and multiple neural networks for
realistic simulation and image restoration. NeuSee takes into account
open-world scenes with dynamically varying laser wavelengths, intensities, and
positions, as well as lens flare effects, unknown ambient lighting conditions,
and sensor noises. It outperforms other learned DOEs, achieving full-spectrum
imaging and laser suppression for the first time, with a 10.1\% improvement in
restored image quality.
comment: accepted by ICCVW 2025
☆ Multimodal Data Storage and Retrieval for Embodied AI: A Survey
Embodied AI (EAI) agents continuously interact with the physical world,
generating vast, heterogeneous multimodal data streams that traditional
management systems are ill-equipped to handle. In this survey, we first
systematically evaluate five storage architectures (Graph Databases,
Multi-Model Databases, Data Lakes, Vector Databases, and Time-Series
Databases), focusing on their suitability for addressing EAI's core
requirements, including physical grounding, low-latency access, and dynamic
scalability. We then analyze five retrieval paradigms (Fusion Strategy-Based
Retrieval, Representation Alignment-Based Retrieval, Graph-Structure-Based
Retrieval, Generation Model-Based Retrieval, and Efficient Retrieval-Based
Optimization), revealing a fundamental tension between achieving long-term
semantic coherence and maintaining real-time responsiveness. Based on this
comprehensive analysis, we identify key bottlenecks, spanning from the
foundational Physical Grounding Gap to systemic challenges in cross-modal
integration, dynamic adaptation, and open-world generalization. Finally, we
outline a forward-looking research agenda encompassing physics-aware data
models, adaptive storage-retrieval co-optimization, and standardized
benchmarking, to guide future research toward principled data management
solutions for EAI. Our survey is based on a comprehensive review of more than
180 related studies, providing a rigorous roadmap for designing the robust,
high-performance data management frameworks essential for the next generation
of autonomous embodied systems.
☆ SCRNet: Spatial-Channel Regulation Network for Medical Ultrasound Image Segmentation
Medical ultrasound image segmentation presents a formidable challenge in the
realm of computer vision. Traditional approaches rely on Convolutional Neural
Networks (CNNs) and Transformer-based methods to address the intricacies of
medical image segmentation. Nevertheless, inherent limitations persist, as
CNN-based methods tend to disregard long-range dependencies, while
Transformer-based methods may overlook local contextual information. To address
these deficiencies, we propose a novel Feature Aggregation Module (FAM)
designed to process two input features from the preceding layer. These features
are seamlessly directed into two branches of the Convolution and
Cross-Attention Parallel Module (CCAPM) to endow them with different roles in
each of the two branches to help establish a strong connection between the two
input features. This strategy enables our module to focus concurrently on both
long-range dependencies and local contextual information by judiciously merging
convolution operations with cross-attention mechanisms. Moreover, by
integrating FAM within our proposed Spatial-Channel Regulation Module (SCRM),
the ability to discern salient regions and informative features warranting
increased attention is enhanced. Furthermore, by incorporating the SCRM into
the encoder block of the UNet architecture, we introduce a novel framework
dubbed Spatial-Channel Regulation Network (SCRNet). The results of our
extensive experiments demonstrate the superiority of SCRNet, which consistently
achieves state-of-the-art (SOTA) performance compared to existing methods.
comment: 8 pagegs
☆ Forecasting Smog Events Using ConvLSTM: A Spatio-Temporal Approach for Aerosol Index Prediction in South Asia
The South Asian Smog refers to the recurring annual air pollution events
marked by high contaminant levels, reduced visibility, and significant
socio-economic impacts, primarily affecting the Indo-Gangetic Plains (IGP) from
November to February. Over the past decade, increased air pollution sources
such as crop residue burning, motor vehicles, and changing weather patterns
have intensified these smog events. However, real-time forecasting systems for
increased particulate matter concentrations are still not established at
regional scale. The Aerosol Index, closely tied to smog formation and a key
component in calculating the Air Quality Index (AQI), reflects particulate
matter concentrations. This study forecasts aerosol events using Sentinel-5P
air constituent data (2019-2023) and a Convolutional Long-Short Term Memory
(ConvLSTM) neural network, which captures spatial and temporal correlations
more effectively than previous models. Using the Ultraviolet (UV) Aerosol Index
at 340-380 nm as the predictor, results show the Aerosol Index can be
forecasted at five-day intervals with a Mean Squared Error of ~0.0018, loss of
~0.3995, and Structural Similarity Index of ~0.74. While effective, the model
can be improved by integrating additional data and refining its architecture.
☆ In-hoc Concept Representations to Regularise Deep Learning in Medical Imaging ICCV
Valentina Corbetta, Floris Six Dijkstra, Regina Beets-Tan, Hoel Kervadec, Kristoffer Wickstrøm, Wilson Silva
Deep learning models in medical imaging often achieve strong in-distribution
performance but struggle to generalise under distribution shifts, frequently
relying on spurious correlations instead of clinically meaningful features. We
introduce LCRReg, a novel regularisation approach that leverages Latent Concept
Representations (LCRs) (e.g., Concept Activation Vectors (CAVs)) to guide
models toward semantically grounded representations. LCRReg requires no concept
labels in the main training set and instead uses a small auxiliary dataset to
synthesise high-quality, disentangled concept examples. We extract LCRs for
predefined relevant features, and incorporate a regularisation term that guides
a Convolutional Neural Network (CNN) to activate within latent subspaces
associated with those concepts. We evaluate LCRReg across synthetic and
real-world medical tasks. On a controlled toy dataset, it significantly
improves robustness to injected spurious correlations and remains effective
even in multi-concept and multiclass settings. On the diabetic retinopathy
binary classification task, LCRReg enhances performance under both synthetic
spurious perturbations and out-of-distribution (OOD) generalisation. Compared
to baselines, including multitask learning, linear probing, and post-hoc
concept-based models, LCRReg offers a lightweight, architecture-agnostic
strategy for improving model robustness without requiring dense concept
supervision. Code is available at the following link:
https://github.com/Trustworthy-AI-UU-NKI/lcr\_regularization
comment: 13 pages, 13 figures, 2 tables, accepted at PHAROS-AFE-AIMI Workshop
in conjunction with the International Conference on Computer Vision (ICCV),
2025. This is the submitted manuscript with added link to the github repo,
funding acknowledgments and author names and affiliations, and a correction
to numbers in Table 1. Final version not published yet
☆ RICO: Two Realistic Benchmarks and an In-Depth Analysis for Incremental Learning in Object Detection ICCV
Incremental Learning (IL) trains models sequentially on new data without full
retraining, offering privacy, efficiency, and scalability. IL must balance
adaptability to new data with retention of old knowledge. However, evaluations
often rely on synthetic, simplified benchmarks, obscuring real-world IL
performance. To address this, we introduce two Realistic Incremental Object
Detection Benchmarks (RICO): Domain RICO (D-RICO) features domain shifts with a
fixed class set, and Expanding-Classes RICO (EC-RICO) integrates new domains
and classes per IL step. Built from 14 diverse datasets covering real and
synthetic domains, varying conditions (e.g., weather, time of day), camera
sensors, perspectives, and labeling policies, both benchmarks capture
challenges absent in existing evaluations. Our experiments show that all IL
methods underperform in adaptability and retention, while replaying a small
amount of previous data already outperforms all methods. However, individual
training on the data remains superior. We heuristically attribute this gap to
weak teachers in distillation, single models' inability to manage diverse
tasks, and insufficient plasticity. Our code will be made publicly available.
comment: Accepted to ICCV Workshops 2025
☆ A Novel Attention-Augmented Wavelet YOLO System for Real-time Brain Vessel Segmentation on Transcranial Color-coded Doppler
Wenxuan Zhang, Shuai Li, Xinyi Wang, Yu Sun, Hongyu Kang, Pui Yuk Chryste Wan, Yong-Ping Zheng, Sai-Kit Lam
The Circle of Willis (CoW), vital for ensuring consistent blood flow to the
brain, is closely linked to ischemic stroke. Accurate assessment of the CoW is
important for identifying individuals at risk and guiding appropriate clinical
management. Among existing imaging methods, Transcranial Color-coded Doppler
(TCCD) offers unique advantages due to its radiation-free nature,
affordability, and accessibility. However, reliable TCCD assessments depend
heavily on operator expertise for identifying anatomical landmarks and
performing accurate angle correction, which limits its widespread adoption. To
address this challenge, we propose an AI-powered, real-time CoW
auto-segmentation system capable of efficiently capturing cerebral arteries. No
prior studies have explored AI-driven cerebrovascular segmentation using TCCD.
In this work, we introduce a novel Attention-Augmented Wavelet YOLO (AAW-YOLO)
network tailored for TCCD data, designed to provide real-time guidance for
brain vessel segmentation in the CoW. We prospectively collected TCCD data
comprising 738 annotated frames and 3,419 labeled artery instances to establish
a high-quality dataset for model training and evaluation. The proposed AAW-YOLO
demonstrated strong performance in segmenting both ipsilateral and
contralateral CoW vessels, achieving an average Dice score of 0.901, IoU of
0.823, precision of 0.882, recall of 0.926, and mAP of 0.953, with a per-frame
inference speed of 14.199 ms. This system offers a practical solution to reduce
reliance on operator experience in TCCD-based cerebrovascular screening, with
potential applications in routine clinical workflows and resource-constrained
settings. Future research will explore bilateral modeling and larger-scale
validation.
☆ A Comprehensive Re-Evaluation of Biometric Modality Properties in the Modern Era
The rapid advancement of authentication systems and their increasing reliance
on biometrics for faster and more accurate user verification experience,
highlight the critical need for a reliable framework to evaluate the
suitability of biometric modalities for specific applications. Currently, the
most widely known evaluation framework is a comparative table from 1998, which
no longer adequately captures recent technological developments or emerging
vulnerabilities in biometric systems. To address these challenges, this work
revisits the evaluation of biometric modalities through an expert survey
involving 24 biometric specialists. The findings indicate substantial shifts in
property ratings across modalities. For example, face recognition, shows
improved ratings due to technological progress, while fingerprint, shows
decreased reliability because of emerging vulnerabilities and attacks. Further
analysis of expert agreement levels across rated properties highlighted the
consistency of the provided evaluations and ensured the reliability of the
ratings. Finally, expert assessments are compared with dataset-level
uncertainty across 55 biometric datasets, revealing strong alignment in most
modalities and underscoring the importance of integrating empirical evidence
with expert insight. Moreover, the identified expert disagreements reveal key
open challenges and help guide future research toward resolving them.
☆ RED.AI Id-Pattern: First Results of Stone Deterioration Patterns with Multi-Agent Systems
The Id-Pattern system within the RED.AI project (Reabilita\c{c}\~ao
Estrutural Digital atrav\'es da AI) consists of an agentic system designed to
assist in the identification of stone deterioration patterns. Traditional
methodologies, based on direct observation by expert teams, are accurate but
costly in terms of time and resources. The system developed here introduces and
evaluates a multi-agent artificial intelligence (AI) system, designed to
simulate collaboration between experts and automate the diagnosis of stone
pathologies from visual evidence. The approach is based on a cognitive
architecture that orchestrates a team of specialized AI agents which, in this
specific case, are limited to five: a lithologist, a pathologist, an
environmental expert, a conservator-restorer, and a diagnostic coordinator. To
evaluate the system we selected 28 difficult images involving multiple
deterioration patterns. Our first results showed a huge boost on all metrics of
our system compared to the foundational model.
comment: 11 pages, 1 figure, 1 table. Contribution for REEACH 2025 Symposium
☆ SAGA: Learning Signal-Aligned Distributions for Improved Text-to-Image Generation
State-of-the-art text-to-image models produce visually impressive results but
often struggle with precise alignment to text prompts, leading to missing
critical elements or unintended blending of distinct concepts. We propose a
novel approach that learns a high-success-rate distribution conditioned on a
target prompt, ensuring that generated images faithfully reflect the
corresponding prompts. Our method explicitly models the signal component during
the denoising process, offering fine-grained control that mitigates
over-optimization and out-of-distribution artifacts. Moreover, our framework is
training-free and seamlessly integrates with both existing diffusion and flow
matching architectures. It also supports additional conditioning modalities --
such as bounding boxes -- for enhanced spatial alignment. Extensive experiments
demonstrate that our approach outperforms current state-of-the-art methods. The
code is available at https://github.com/grimalPaul/gsn-factory.
☆ Latent Interpolation Learning Using Diffusion Models for Cardiac Volume Reconstruction
Niklas Bubeck, Suprosanna Shit, Chen Chen, Can Zhao, Pengfei Guo, Dong Yang, Georg Zitzlsberger, Daguang Xu, Bernhard Kainz, Daniel Rueckert, Jiazhen Pan
Cardiac Magnetic Resonance (CMR) imaging is a critical tool for diagnosing
and managing cardiovascular disease, yet its utility is often limited by the
sparse acquisition of 2D short-axis slices, resulting in incomplete volumetric
information. Accurate 3D reconstruction from these sparse slices is essential
for comprehensive cardiac assessment, but existing methods face challenges,
including reliance on predefined interpolation schemes (e.g., linear or
spherical), computational inefficiency, and dependence on additional semantic
inputs such as segmentation labels or motion data. To address these
limitations, we propose a novel \textbf{Ca}rdiac \textbf{L}atent
\textbf{I}nterpolation \textbf{D}iffusion (CaLID) framework that introduces
three key innovations. First, we present a data-driven interpolation scheme
based on diffusion models, which can capture complex, non-linear relationships
between sparse slices and improves reconstruction accuracy. Second, we design a
computationally efficient method that operates in the latent space and speeds
up 3D whole-heart upsampling time by a factor of 24, reducing computational
overhead compared to previous methods. Third, with only sparse 2D CMR images as
input, our method achieves SOTA performance against baseline methods,
eliminating the need for auxiliary input such as morphological guidance, thus
simplifying workflows. We further extend our method to 2D+T data, enabling the
effective modeling of spatiotemporal dynamics and ensuring temporal coherence.
Extensive volumetric evaluations and downstream segmentation tasks demonstrate
that CaLID achieves superior reconstruction quality and efficiency. By
addressing the fundamental limitations of existing approaches, our framework
advances the state of the art for spatio and spatiotemporal whole-heart
reconstruction, offering a robust and clinically practical solution for
cardiovascular imaging.
☆ Self-Aware Adaptive Alignment: Enabling Accurate Perception for Intelligent Transportation Systems
Achieving top-notch performance in Intelligent Transportation detection is a
critical research area. However, many challenges still need to be addressed
when it comes to detecting in a cross-domain scenario. In this paper, we
propose a Self-Aware Adaptive Alignment (SA3), by leveraging an efficient
alignment mechanism and recognition strategy. Our proposed method employs a
specified attention-based alignment module trained on source and target domain
datasets to guide the image-level features alignment process, enabling the
local-global adaptive alignment between the source domain and target domain.
Features from both domains, whose channel importance is re-weighted, are fed
into the region proposal network, which facilitates the acquisition of salient
region features. Also, we introduce an instance-to-image level alignment module
specific to the target domain to adaptively mitigate the domain gap. To
evaluate the proposed method, extensive experiments have been conducted on
popular cross-domain object detection benchmarks. Experimental results show
that SA3 achieves superior results to the previous state-of-the-art methods.
comment: Domain adaptation, Virtual Reality, Object Detection
☆ Unsupervised Urban Tree Biodiversity Mapping from Street-Level Imagery Using Spatially-Aware Visual Clustering
Diaa Addeen Abuhani, Marco Seccaroni, Martina Mazzarello, Imran Zualkernan, Fabio Duarte, Carlo Ratti
Urban tree biodiversity is critical for climate resilience, ecological
stability, and livability in cities, yet most municipalities lack detailed
knowledge of their canopies. Field-based inventories provide reliable estimates
of Shannon and Simpson diversity but are costly and time-consuming, while
supervised AI methods require labeled data that often fail to generalize across
regions. We introduce an unsupervised clustering framework that integrates
visual embeddings from street-level imagery with spatial planting patterns to
estimate biodiversity without labels. Applied to eight North American cities,
the method recovers genus-level diversity patterns with high fidelity,
achieving low Wasserstein distances to ground truth for Shannon and Simpson
indices and preserving spatial autocorrelation. This scalable, fine-grained
approach enables biodiversity mapping in cities lacking detailed inventories
and offers a pathway for continuous, low-cost monitoring to support equitable
access to greenery and adaptive management of urban ecosystems.
comment: 26 pages, 7 figures, Nature Format
☆ Timestep-Compressed Attack on Spiking Neural Networks through Timestep-Level Backpropagation
State-of-the-art (SOTA) gradient-based adversarial attacks on spiking neural
networks (SNNs), which largely rely on extending FGSM and PGD frameworks, face
a critical limitation: substantial attack latency from multi-timestep
processing, rendering them infeasible for practical real-time applications.
This inefficiency stems from their design as direct extensions of ANN
paradigms, which fail to exploit key SNN properties. In this paper, we propose
the timestep-compressed attack (TCA), a novel framework that significantly
reduces attack latency. TCA introduces two components founded on key insights
into SNN behavior. First, timestep-level backpropagation (TLBP) is based on our
finding that global temporal information in backpropagation to generate
perturbations is not critical for an attack's success, enabling per-timestep
evaluation for early stopping. Second, adversarial membrane potential reuse
(A-MPR) is motivated by the observation that initial timesteps are
inefficiently spent accumulating membrane potential, a warm-up phase that can
be pre-calculated and reused. Our experiments on VGG-11 and ResNet-17 with the
CIFAR-10/100 and CIFAR10-DVS datasets show that TCA significantly reduces the
required attack latency by up to 56.6% and 57.1% compared to SOTA methods in
white-box and black-box settings, respectively, while maintaining a comparable
attack success rate.
comment: 8 pages
☆ Is-NeRF: In-scattering Neural Radiance Field for Blurred Images
Neural Radiance Fields (NeRF) has gained significant attention for its
prominent implicit 3D representation and realistic novel view synthesis
capabilities. Available works unexceptionally employ straight-line volume
rendering, which struggles to handle sophisticated lightpath scenarios and
introduces geometric ambiguities during training, particularly evident when
processing motion-blurred images. To address these challenges, this work
proposes a novel deblur neural radiance field, Is-NeRF, featuring explicit
lightpath modeling in real-world environments. By unifying six common light
propagation phenomena through an in-scattering representation, we establish a
new scattering-aware volume rendering pipeline adaptable to complex lightpaths.
Additionally, we introduce an adaptive learning strategy that enables
autonomous determining of scattering directions and sampling intervals to
capture finer object details. The proposed network jointly optimizes NeRF
parameters, scattering parameters, and camera motions to recover fine-grained
scene representations from blurry images. Comprehensive evaluations demonstrate
that it effectively handles complex real-world scenarios, outperforming
state-of-the-art approaches in generating high-fidelity images with accurate
geometric details.
☆ Sketch3DVE: Sketch-based 3D-Aware Scene Video Editing SIGGRAPH 2025
Recent video editing methods achieve attractive results in style transfer or
appearance modification. However, editing the structural content of 3D scenes
in videos remains challenging, particularly when dealing with significant
viewpoint changes, such as large camera rotations or zooms. Key challenges
include generating novel view content that remains consistent with the original
video, preserving unedited regions, and translating sparse 2D inputs into
realistic 3D video outputs. To address these issues, we propose Sketch3DVE, a
sketch-based 3D-aware video editing method to enable detailed local
manipulation of videos with significant viewpoint changes. To solve the
challenge posed by sparse inputs, we employ image editing methods to generate
edited results for the first frame, which are then propagated to the remaining
frames of the video. We utilize sketching as an interaction tool for precise
geometry control, while other mask-based image editing methods are also
supported. To handle viewpoint changes, we perform a detailed analysis and
manipulation of the 3D information in the video. Specifically, we utilize a
dense stereo method to estimate a point cloud and the camera parameters of the
input video. We then propose a point cloud editing approach that uses depth
maps to represent the 3D geometry of newly edited components, aligning them
effectively with the original 3D scene. To seamlessly merge the newly edited
content with the original video while preserving the features of unedited
regions, we introduce a 3D-aware mask propagation strategy and employ a video
diffusion model to produce realistic edited videos. Extensive experiments
demonstrate the superiority of Sketch3DVE in video editing. Homepage and code:
http://http://geometrylearning.com/Sketch3DVE/
comment: SIGGRAPH 2025
☆ A Fully Transformer Based Multimodal Framework for Explainable Cancer Image Segmentation Using Radiology Reports
We introduce Med-CTX, a fully transformer based multimodal framework for
explainable breast cancer ultrasound segmentation. We integrate clinical
radiology reports to boost both performance and interpretability. Med-CTX
achieves exact lesion delineation by using a dual-branch visual encoder that
combines ViT and Swin transformers, as well as uncertainty aware fusion.
Clinical language structured with BI-RADS semantics is encoded by
BioClinicalBERT and combined with visual features utilising cross-modal
attention, allowing the model to provide clinically grounded, model generated
explanations. Our methodology generates segmentation masks, uncertainty maps,
and diagnostic rationales all at once, increasing confidence and transparency
in computer assisted diagnosis. On the BUS-BRA dataset, Med-CTX achieves a Dice
score of 99% and an IoU of 95%, beating existing baselines U-Net, ViT, and
Swin. Clinical text plays a key role in segmentation accuracy and explanation
quality, as evidenced by ablation studies that show a -5.4% decline in Dice
score and -31% in CIDEr. Med-CTX achieves good multimodal alignment (CLIP
score: 85%) and increased confi dence calibration (ECE: 3.2%), setting a new
bar for trustworthy, multimodal medical architecture.
☆ VisionLaw: Inferring Interpretable Intrinsic Dynamics from Visual Observations via Bilevel Optimization
The intrinsic dynamics of an object governs its physical behavior in the real
world, playing a critical role in enabling physically plausible interactive
simulation with 3D assets. Existing methods have attempted to infer the
intrinsic dynamics of objects from visual observations, but generally face two
major challenges: one line of work relies on manually defined constitutive
priors, making it difficult to generalize to complex scenarios; the other
models intrinsic dynamics using neural networks, resulting in limited
interpretability and poor generalization. To address these challenges, we
propose VisionLaw, a bilevel optimization framework that infers interpretable
expressions of intrinsic dynamics from visual observations. At the upper level,
we introduce an LLMs-driven decoupled constitutive evolution strategy, where
LLMs are prompted as a knowledgeable physics expert to generate and revise
constitutive laws, with a built-in decoupling mechanism that substantially
reduces the search complexity of LLMs. At the lower level, we introduce a
vision-guided constitutive evaluation mechanism, which utilizes visual
simulation to evaluate the consistency between the generated constitutive law
and the underlying intrinsic dynamics, thereby guiding the upper-level
evolution. Experiments on both synthetic and real-world datasets demonstrate
that VisionLaw can effectively infer interpretable intrinsic dynamics from
visual observations. It significantly outperforms existing state-of-the-art
methods and exhibits strong generalization for interactive simulation in novel
scenarios.
comment: 9 pages, 6 figures
☆ Shape-from-Template with Generalised Camera
This article presents a new method for non-rigidly registering a 3D shape to
2D keypoints observed by a constellation of multiple cameras. Non-rigid
registration of a 3D shape to observed 2D keypoints, i.e., Shape-from-Template
(SfT), has been widely studied using single images, but SfT with information
from multiple-cameras jointly opens new directions for extending the scope of
known use-cases such as 3D shape registration in medical imaging and
registration from hand-held cameras, to name a few. We represent such
multi-camera setup with the generalised camera model; therefore any collection
of perspective or orthographic cameras observing any deforming object can be
registered. We propose multiple approaches for such SfT: the first approach
where the corresponded keypoints lie on a direction vector from a known 3D
point in space, the second approach where the corresponded keypoints lie on a
direction vector from an unknown 3D point in space but with known orientation
w.r.t some local reference frame, and a third approach where, apart from
correspondences, the silhouette of the imaged object is also known. Together,
these form the first set of solutions to the SfT problem with generalised
cameras. The key idea behind SfT with generalised camera is the improved
reconstruction accuracy from estimating deformed shape while utilising the
additional information from the mutual constraints between multiple views of a
deformed object. The correspondence-based approaches are solved with convex
programming while the silhouette-based approach is an iterative refinement of
the results from the convex solutions. We demonstrate the accuracy of our
proposed methods on many synthetic and real data
comment: Pre-print of the IMAVIS article:
https://www.sciencedirect.com/science/article/abs/pii/S0262885625001672 Code
and data in: https://git.zib.de/asengupta/sft-generalised
☆ Comparing Conditional Diffusion Models for Synthesizing Contrast-Enhanced Breast MRI from Pre-Contrast Images MICCAI
Sebastian Ibarra, Javier del Riego, Alessandro Catanese, Julian Cuba, Julian Cardona, Nataly Leon, Jonathan Infante, Karim Lekadir, Oliver Diaz, Richard Osuala
Dynamic contrast-enhanced (DCE) MRI is essential for breast cancer diagnosis
and treatment. However, its reliance on contrast agents introduces safety
concerns, contraindications, increased cost, and workflow complexity. To this
end, we present pre-contrast conditioned denoising diffusion probabilistic
models to synthesize DCE-MRI, introducing, evaluating, and comparing a total of
22 generative model variants in both single-breast and full breast settings.
Towards enhancing lesion fidelity, we introduce both tumor-aware loss functions
and explicit tumor segmentation mask conditioning. Using a public multicenter
dataset and comparing to respective pre-contrast baselines, we observe that
subtraction image-based models consistently outperform post-contrast-based
models across five complementary evaluation metrics. Apart from assessing the
entire image, we also separately evaluate the region of interest, where both
tumor-aware losses and segmentation mask inputs improve evaluation metrics. The
latter notably enhance qualitative results capturing contrast uptake, albeit
assuming access to tumor localization inputs that are not guaranteed to be
available in screening settings. A reader study involving 2 radiologists and 4
MRI technologists confirms the high realism of the synthetic images, indicating
an emerging clinical potential of generative contrast-enhancement. We share our
codebase at https://github.com/sebastibar/conditional-diffusion-breast-MRI.
comment: 13 pages, 5 figures, submitted and accepted to MICCAI Deepbreath
workshop 2025
☆ MR6D: Benchmarking 6D Pose Estimation for Mobile Robots CVPR 2025
Anas Gouda, Shrutarv Awasthi, Christian Blesing, Lokeshwaran Manohar, Frank Hoffmann, Alice Kirchheim
Existing 6D pose estimation datasets primarily focus on small household
objects typically handled by robot arm manipulators, limiting their relevance
to mobile robotics. Mobile platforms often operate without manipulators,
interact with larger objects, and face challenges such as long-range
perception, heavy self-occlusion, and diverse camera perspectives. While recent
models generalize well to unseen objects, evaluations remain confined to
household-like settings that overlook these factors. We introduce MR6D, a
dataset designed for 6D pose estimation for mobile robots in industrial
environments. It includes 92 real-world scenes featuring 16 unique objects
across static and dynamic interactions. MR6D captures the challenges specific
to mobile platforms, including distant viewpoints, varied object
configurations, larger object sizes, and complex occlusion/self-occlusion
patterns. Initial experiments reveal that current 6D pipelines underperform in
these settings, with 2D segmentation being another hurdle. MR6D establishes a
foundation for developing and evaluating pose estimation methods tailored to
the demands of mobile robotics. The dataset is available at
https://huggingface.co/datasets/anas-gouda/mr6d.
comment: accepted CVPR 2025 Workshop on Recovering 6D Object Pose (R6D)
☆ Deep Biomechanically-Guided Interpolation for Keypoint-Based Brain Shift Registration MICCAI 2025
Accurate compensation of brain shift is critical for maintaining the
reliability of neuronavigation during neurosurgery. While keypoint-based
registration methods offer robustness to large deformations and topological
changes, they typically rely on simple geometric interpolators that ignore
tissue biomechanics to create dense displacement fields. In this work, we
propose a novel deep learning framework that estimates dense, physically
plausible brain deformations from sparse matched keypoints. We first generate a
large dataset of synthetic brain deformations using biomechanical simulations.
Then, a residual 3D U-Net is trained to refine standard interpolation estimates
into biomechanically guided deformations. Experiments on a large set of
simulated displacement fields demonstrate that our method significantly
outperforms classical interpolators, reducing by half the mean square error
while introducing negligible computational overhead at inference time. Code
available at:
\href{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}{https://github.com/tiago-assis/Deep-Biomechanical-Interpolator}.
comment: Accepted at COLlaborative Intelligence and Autonomy in Image-guided
Surgery (COLAS) Workshop - MICCAI 2025
☆ Mitigating Cross-Image Information Leakage in LVLMs for Multi-Image Tasks
Large Vision-Language Models (LVLMs) demonstrate strong performance on
single-image tasks. However, we observe that their performance degrades
significantly when handling multi-image inputs. This occurs because visual cues
from different images become entangled in the model's output. We refer to this
phenomenon as cross-image information leakage. To address this issue, we
propose FOCUS, a training-free and architecture-agnostic decoding strategy that
mitigates cross-image information leakage during inference. FOCUS sequentially
masks all but one image with random noise, guiding the model to focus on the
single clean image. We repeat this process across all target images to obtain
logits under partially masked contexts. These logits are aggregated and then
contrastively refined using a noise-only reference input, which suppresses the
leakage and yields more accurate outputs. FOCUS consistently improves
performance across four multi-image benchmarks and diverse LVLM families. This
demonstrates that FOCUS offers a general and practical solution for enhancing
multi-image reasoning without additional training or architectural
modifications.
comment: Source code is available at https://github.com/yejipark-m/FOCUS
☆ Enhancing Targeted Adversarial Attacks on Large Vision-Language Models through Intermediate Projector Guidance
Targeted adversarial attacks are essential for proactively identifying
security flaws in Vision-Language Models before real-world deployment. However,
current methods perturb images to maximize global similarity with the target
text or reference image at the encoder level, collapsing rich visual semantics
into a single global vector. This limits attack granularity, hindering
fine-grained manipulations such as modifying a car while preserving its
background. Furthermore, these methods largely overlook the projector module, a
critical semantic bridge between the visual encoder and the language model in
VLMs, thereby failing to disrupt the full vision-language alignment pipeline
within VLMs and limiting attack effectiveness. To address these issues, we
propose the Intermediate Projector Guided Attack (IPGA), the first method to
attack using the intermediate stage of the projector module, specifically the
widely adopted Q-Former, which transforms global image embeddings into
fine-grained visual features. This enables more precise control over
adversarial perturbations by operating on semantically meaningful visual tokens
rather than a single global representation. Specifically, IPGA leverages the
Q-Former pretrained solely on the first vision-language alignment stage,
without LLM fine-tuning, which improves both attack effectiveness and
transferability across diverse VLMs. Furthermore, we propose Residual Query
Alignment (RQA) to preserve unrelated visual content, thereby yielding more
controlled and precise adversarial manipulations. Extensive experiments show
that our attack method consistently outperforms existing methods in both
standard global image captioning tasks and fine-grained visual
question-answering tasks in black-box environment. Additionally, IPGA
successfully transfers to multiple commercial VLMs, including Google Gemini and
OpenAI GPT.
☆ Hierarchical Vision-Language Retrieval of Educational Metaverse Content in Agriculture
Every day, a large amount of educational content is uploaded online across
different areas, including agriculture and gardening. When these videos or
materials are grouped meaningfully, they can make learning easier and more
effective. One promising way to organize and enrich such content is through the
Metaverse, which allows users to explore educational experiences in an
interactive and immersive environment. However, searching for relevant
Metaverse scenarios and finding those matching users' interests remains a
challenging task. A first step in this direction has been done recently, but
existing datasets are small and not sufficient for training advanced models. In
this work, we make two main contributions: first, we introduce a new dataset
containing 457 agricultural-themed virtual museums (AgriMuseums), each enriched
with textual descriptions; and second, we propose a hierarchical
vision-language model to represent and retrieve relevant AgriMuseums using
natural language queries. In our experimental setting, the proposed method
achieves up to about 62\% R@1 and 78\% MRR, confirming its effectiveness, and
it also leads to improvements on existing benchmarks by up to 6\% R@1 and 11\%
MRR. Moreover, an extensive evaluation validates our design choices. Code and
dataset are available at
https://github.com/aliabdari/Agricultural_Metaverse_Retrieval .
comment: Accepted for publication at the 23rd International Conference on
Image Analysis and Processing (ICIAP 2025)
☆ Diversity-enhanced Collaborative Mamba for Semi-supervised Medical Image Segmentation
Acquiring high-quality annotated data for medical image segmentation is
tedious and costly. Semi-supervised segmentation techniques alleviate this
burden by leveraging unlabeled data to generate pseudo labels. Recently,
advanced state space models, represented by Mamba, have shown efficient
handling of long-range dependencies. This drives us to explore their potential
in semi-supervised medical image segmentation. In this paper, we propose a
novel Diversity-enhanced Collaborative Mamba framework (namely DCMamba) for
semi-supervised medical image segmentation, which explores and utilizes the
diversity from data, network, and feature perspectives. Firstly, from the data
perspective, we develop patch-level weak-strong mixing augmentation with
Mamba's scanning modeling characteristics. Moreover, from the network
perspective, we introduce a diverse-scan collaboration module, which could
benefit from the prediction discrepancies arising from different scanning
directions. Furthermore, from the feature perspective, we adopt an
uncertainty-weighted contrastive learning mechanism to enhance the diversity of
feature representation. Experiments demonstrate that our DCMamba significantly
outperforms other semi-supervised medical image segmentation methods, e.g.,
yielding the latest SSM-based method by 6.69% on the Synapse dataset with 20%
labeled data.
☆ subCellSAM: Zero-Shot (Sub-)Cellular Segmentation for Hit Validation in Drug Discovery
High-throughput screening using automated microscopes is a key driver in
biopharma drug discovery, enabling the parallel evaluation of thousands of drug
candidates for diseases such as cancer. Traditional image analysis and deep
learning approaches have been employed to analyze these complex, large-scale
datasets, with cell segmentation serving as a critical step for extracting
relevant structures. However, both strategies typically require extensive
manual parameter tuning or domain-specific model fine-tuning. We present a
novel method that applies a segmentation foundation model in a zero-shot
setting (i.e., without fine-tuning), guided by an in-context learning strategy.
Our approach employs a three-step process for nuclei, cell, and subcellular
segmentation, introducing a self-prompting mechanism that encodes morphological
and topological priors using growing masks and strategically placed
foreground/background points. We validate our method on both standard cell
segmentation benchmarks and industry-relevant hit validation assays,
demonstrating that it accurately segments biologically relevant structures
without the need for dataset-specific tuning.
comment: Accepted at DAGM German Conference on Pattern Recognition (GCPR) 2025
☆ HumanPCR: Probing MLLM Capabilities in Diverse Human-Centric Scenes
Keliang Li, Hongze Shen, Hao Shi, Ruibing Hou, Hong Chang, Jie Huang, Chenghao Jia, Wen Wang, Yiling Wu, Dongmei Jiang, Shiguang Shan, Xilin Chen
The aspiration for artificial general intelligence, fueled by the rapid
progress of multimodal models, demands human-comparable performance across
diverse environments. We propose HumanPCR, an evaluation suite for probing
MLLMs' capacity about human-related visual contexts across three hierarchical
levels: Perception, Comprehension, and Reasoning (denoted by Human-P, Human-C,
and Human-R, respectively). Human-P and Human-C feature over 6,000
human-verified multiple choice questions, assessing massive tasks of 9
dimensions, including but not limited to essential skills frequently overlooked
by existing benchmarks. Human-R offers a challenging manually curated video
reasoning test that requires integrating multiple visual evidences, proactively
extracting context beyond question cues, and applying human-like expertise.
Each question includes human-annotated Chain-of-Thought (CoT) rationales with
key visual evidence to support further research. Extensive evaluations on over
30 state-of-the-art models exhibit significant challenges in human-centric
visual understanding, particularly in tasks involving detailed space
perception, temporal understanding, and mind modeling. Moreover, analysis of
Human-R reveals the struggle of models in extracting essential proactive visual
evidence from diverse human scenes and their faulty reliance on query-guided
retrieval. Even with advanced techniques like scaling visual contexts and
test-time thinking yield only limited benefits. We hope HumanPCR and our
findings will advance the development, evaluation, and human-centric
application of multimodal models.
☆ DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction
The automated extraction of complete and precise road network graphs from
remote sensing imagery remains a critical challenge in geospatial computer
vision. Segmentation-based approaches, while effective in pixel-level
recognition, struggle to maintain topology fidelity after vectorization
postprocessing. Graph-growing methods build more topologically faithful graphs
but suffer from computationally prohibitive iterative ROI cropping.
Graph-generating methods first predict global static candidate road network
vertices, and then infer possible edges between vertices. They achieve fast
topology-aware inference, but limits the dynamic insertion of vertices. To
address these challenges, we propose DeH4R, a novel hybrid model that combines
graph-generating efficiency and graph-growing dynamics. This is achieved by
decoupling the task into candidate vertex detection, adjacent vertex
prediction, initial graph contruction, and graph expansion. This architectural
innovation enables dynamic vertex (edge) insertions while retaining fast
inference speed and enhancing both topology fidelity and spatial consistency.
Comprehensive evaluations on CityScale and SpaceNet benchmarks demonstrate
state-of-the-art (SOTA) performance. DeH4R outperforms the prior SOTA
graph-growing method RNGDet++ by 4.62 APLS and 10.18 IoU on CityScale, while
being approximately 10 $\times$ faster. The code will be made publicly
available at https://github.com/7777777FAN/DeH4R.
comment: Under review
☆ Model-based Multi-object Visual Tracking: Identification and Standard Model Limitations
This paper uses multi-object tracking methods known from the radar tracking
community to address the problem of pedestrian tracking using 2D bounding box
detections. The standard point-object (SPO) model is adopted, and the posterior
density is computed using the Poisson multi-Bernoulli mixture (PMBM) filter.
The selection of the model parameters rooted in continuous time is discussed,
including the birth and survival probabilities. Some parameters are selected
from the first principles, while others are identified from the data, which is,
in this case, the publicly available MOT-17 dataset. Although the resulting
PMBM algorithm yields promising results, a mismatch between the SPO model and
the data is revealed. The model-based approach assumes that modifying the
problematic components causing the SPO model-data mismatch will lead to better
model-based algorithms in future developments.
comment: Submitted to FUSION 2025 conference
☆ OmniTry: Virtual Try-On Anything without Masks
Virtual Try-ON (VTON) is a practical and widely-applied task, for which most
of existing works focus on clothes. This paper presents OmniTry, a unified
framework that extends VTON beyond garment to encompass any wearable objects,
e.g., jewelries and accessories, with mask-free setting for more practical
application. When extending to various types of objects, data curation is
challenging for obtaining paired images, i.e., the object image and the
corresponding try-on result. To tackle this problem, we propose a two-staged
pipeline: For the first stage, we leverage large-scale unpaired images, i.e.,
portraits with any wearable items, to train the model for mask-free
localization. Specifically, we repurpose the inpainting model to automatically
draw objects in suitable positions given an empty mask. For the second stage,
the model is further fine-tuned with paired images to transfer the consistency
of object appearance. We observed that the model after the first stage shows
quick convergence even with few paired samples. OmniTry is evaluated on a
comprehensive benchmark consisting of 12 common classes of wearable objects,
with both in-shop and in-the-wild images. Experimental results suggest that
OmniTry shows better performance on both object localization and
ID-preservation compared with existing methods. The code, model weights, and
evaluation benchmark of OmniTry will be made publicly available at
https://omnitry.github.io/.
☆ DiffIER: Optimizing Diffusion Models with Iterative Error Reduction
Diffusion models have demonstrated remarkable capabilities in generating
high-quality samples and enhancing performance across diverse domains through
Classifier-Free Guidance (CFG). However, the quality of generated samples is
highly sensitive to the selection of the guidance weight. In this work, we
identify a critical ``training-inference gap'' and we argue that it is the
presence of this gap that undermines the performance of conditional generation
and renders outputs highly sensitive to the guidance weight. We quantify this
gap by measuring the accumulated error during the inference stage and establish
a correlation between the selection of guidance weight and minimizing this gap.
Furthermore, to mitigate this gap, we propose DiffIER, an optimization-based
method for high-quality generation. We demonstrate that the accumulated error
can be effectively reduced by an iterative error minimization at each step
during inference. By introducing this novel plug-and-play optimization
framework, we enable the optimization of errors at every single inference step
and enhance generation quality. Empirical results demonstrate that our proposed
method outperforms baseline approaches in conditional generation tasks.
Furthermore, the method achieves consistent success in text-to-image
generation, image super-resolution, and text-to-speech generation, underscoring
its versatility and potential for broad applications in future research.
☆ State of Abdominal CT Datasets: A Critical Review of Bias, Clinical Relevance, and Real-world Applicability
Saeide Danaei, Zahra Dehghanian, Elahe Meftah, Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Faeze Khorasanizade, Hamid R. Rabiee
This systematic review critically evaluates publicly available abdominal CT
datasets and their suitability for artificial intelligence (AI) applications in
clinical settings. We examined 46 publicly available abdominal CT datasets
(50,256 studies). Across all 46 datasets, we found substantial redundancy
(59.1\% case reuse) and a Western/geographic skew (75.3\% from North America
and Europe). A bias assessment was performed on the 19 datasets with >=100
cases; within this subset, the most prevalent high-risk categories were domain
shift (63\%) and selection bias (57\%), both of which may undermine model
generalizability across diverse healthcare environments -- particularly in
resource-limited settings. To address these challenges, we propose targeted
strategies for dataset improvement, including multi-institutional
collaboration, adoption of standardized protocols, and deliberate inclusion of
diverse patient populations and imaging technologies. These efforts are crucial
in supporting the development of more equitable and clinically robust AI models
for abdominal imaging.
comment: Preprint. Submitted to IEEE Journal of Biomedical and Health
Informatics (under review). 10 pages, 3 figures, 5 tables
☆ RCGNet: RGB-based Category-Level 6D Object Pose Estimation with Geometric Guidance IROS2025
While most current RGB-D-based category-level object pose estimation methods
achieve strong performance, they face significant challenges in scenes lacking
depth information. In this paper, we propose a novel category-level object pose
estimation approach that relies solely on RGB images. This method enables
accurate pose estimation in real-world scenarios without the need for depth
data. Specifically, we design a transformer-based neural network for
category-level object pose estimation, where the transformer is employed to
predict and fuse the geometric features of the target object. To ensure that
these predicted geometric features faithfully capture the object's geometry, we
introduce a geometric feature-guided algorithm, which enhances the network's
ability to effectively represent the object's geometric information. Finally,
we utilize the RANSAC-PnP algorithm to compute the object's pose, addressing
the challenges associated with variable object scales in pose estimation.
Experimental results on benchmark datasets demonstrate that our approach is not
only highly efficient but also achieves superior accuracy compared to previous
RGB-based methods. These promising results offer a new perspective for
advancing category-level object pose estimation using RGB images.
comment: Accepted by IROS2025
☆ TalkVid: A Large-Scale Diversified Dataset for Audio-Driven Talking Head Synthesis
Shunian Chen, Hejin Huang, Yexin Liu, Zihan Ye, Pengcheng Chen, Chenghao Zhu, Michael Guan, Rongsheng Wang, Junying Chen, Guanbin Li, Ser-Nam Lim, Harry Yang, Benyou Wang
Audio-driven talking head synthesis has achieved remarkable photorealism, yet
state-of-the-art (SOTA) models exhibit a critical failure: they lack
generalization to the full spectrum of human diversity in ethnicity, language,
and age groups. We argue that this generalization gap is a direct symptom of
limitations in existing training data, which lack the necessary scale, quality,
and diversity. To address this challenge, we introduce TalkVid, a new
large-scale, high-quality, and diverse dataset containing 1244 hours of video
from 7729 unique speakers. TalkVid is curated through a principled, multi-stage
automated pipeline that rigorously filters for motion stability, aesthetic
quality, and facial detail, and is validated against human judgments to ensure
its reliability. Furthermore, we construct and release TalkVid-Bench, a
stratified evaluation set of 500 clips meticulously balanced across key
demographic and linguistic axes. Our experiments demonstrate that a model
trained on TalkVid outperforms counterparts trained on previous datasets,
exhibiting superior cross-dataset generalization. Crucially, our analysis on
TalkVid-Bench reveals performance disparities across subgroups that are
obscured by traditional aggregate metrics, underscoring its necessity for
future research. Code and data can be found in
https://github.com/FreedomIntelligence/TalkVid
☆ Two-Factor Authentication Smart Entryway Using Modified LBPH Algorithm
Zakiah Ayop, Wan Mohamad Hariz Bin Wan Mohamad Rosdi, Looi Wei Hua, Syarulnaziah Anawar, Nur Fadzilah Othman
Face mask detection has become increasingly important recently, particularly
during the COVID-19 pandemic. Many face detection models have been developed in
smart entryways using IoT. However, there is a lack of IoT development on face
mask detection. This paper proposes a two-factor authentication system for
smart entryway access control using facial recognition and passcode
verification and an automation process to alert the owner and activate the
surveillance system when a stranger is detected and controls the system
remotely via Telegram on a Raspberry Pi platform. The system employs the Local
Binary Patterns Histograms for the full face recognition algorithm and modified
LBPH algorithm for occluded face detection. On average, the system achieved an
Accuracy of approximately 70%, a Precision of approximately 80%, and a Recall
of approximately 83.26% across all tested users. The results indicate that the
system is capable of conducting face recognition and mask detection, automating
the operation of the remote control to register users, locking or unlocking the
door, and notifying the owner. The sample participants highly accept it for
future use in the user acceptance test.
☆ PersonaVlog: Personalized Multimodal Vlog Generation with Multi-Agent Collaboration and Iterative Self-Correction
With the growing demand for short videos and personalized content, automated
Video Log (Vlog) generation has become a key direction in multimodal content
creation. Existing methods mostly rely on predefined scripts, lacking dynamism
and personal expression. Therefore, there is an urgent need for an automated
Vlog generation approach that enables effective multimodal collaboration and
high personalization. To this end, we propose PersonaVlog, an automated
multimodal stylized Vlog generation framework that can produce personalized
Vlogs featuring videos, background music, and inner monologue speech based on a
given theme and reference image. Specifically, we propose a multi-agent
collaboration framework based on Multimodal Large Language Models (MLLMs). This
framework efficiently generates high-quality prompts for multimodal content
creation based on user input, thereby improving the efficiency and creativity
of the process. In addition, we incorporate a feedback and rollback mechanism
that leverages MLLMs to evaluate and provide feedback on generated results,
thereby enabling iterative self-correction of multimodal content. We also
propose ThemeVlogEval, a theme-based automated benchmarking framework that
provides standardized metrics and datasets for fair evaluation. Comprehensive
experiments demonstrate the significant advantages and potential of our
framework over several baselines, highlighting its effectiveness and great
potential for generating automated Vlogs.
comment: Project Page: https://personavlog-paper.github.io/
☆ Unleashing Semantic and Geometric Priors for 3D Scene Completion
Camera-based 3D semantic scene completion (SSC) provides dense geometric and
semantic perception for autonomous driving and robotic navigation. However,
existing methods rely on a coupled encoder to deliver both semantic and
geometric priors, which forces the model to make a trade-off between
conflicting demands and limits its overall performance. To tackle these
challenges, we propose FoundationSSC, a novel framework that performs dual
decoupling at both the source and pathway levels. At the source level, we
introduce a foundation encoder that provides rich semantic feature priors for
the semantic branch and high-fidelity stereo cost volumes for the geometric
branch. At the pathway level, these priors are refined through specialised,
decoupled pathways, yielding superior semantic context and depth distributions.
Our dual-decoupling design produces disentangled and refined inputs, which are
then utilised by a hybrid view transformation to generate complementary 3D
features. Additionally, we introduce a novel Axis-Aware Fusion (AAF) module
that addresses the often-overlooked challenge of fusing these features by
anisotropically merging them into a unified representation. Extensive
experiments demonstrate the advantages of FoundationSSC, achieving simultaneous
improvements in both semantic and geometric metrics, surpassing prior bests by
+0.23 mIoU and +2.03 IoU on SemanticKITTI. Additionally, we achieve
state-of-the-art performance on SSCBench-KITTI-360, with 21.78 mIoU and 48.61
IoU. The code will be released upon acceptance.
comment: 9 pages, 5 figures, 6 tables
☆ Towards Efficient Vision State Space Models via Token Merging
State Space Models (SSMs) have emerged as powerful architectures in computer
vision, yet improving their computational efficiency remains crucial for
practical and scalable deployment.While token reduction serves as an effective
approach for model efficiency, applying it to SSMs requires careful
consideration of their unique sequential modeling capabilities.In this work, we
propose MaMe, a token-merging strategy tailored for SSM-based vision
models.MaMe addresses two key challenges: quantifying token importance and
preserving sequential properties. Our approach leverages the state transition
parameter $\mathbf{\Delta}$ as an informativeness measure and introduces
strategic token arrangements to preserve sequential information flow.Extensive
experiments demonstrate that MaMe achieves superior efficiency-performance
trade-offs for both fine-tuned and off-the-shelf models. Particularly, our
approach maintains robustness even under aggressive token reduction where
existing methods undergo significant performance degradation.Beyond image
classification, MaMe shows strong generalization capabilities across video and
audio domains, establishing an effective approach for enhancing efficiency in
diverse SSM applications.
comment: under review
☆ Bridging Clear and Adverse Driving Conditions
Autonomous Driving (AD) systems exhibit markedly degraded performance under
adverse environmental conditions, such as low illumination and precipitation.
The underrepresentation of adverse conditions in AD datasets makes it
challenging to address this deficiency. To circumvent the prohibitive cost of
acquiring and annotating adverse weather data, we propose a novel Domain
Adaptation (DA) pipeline that transforms clear-weather images into fog, rain,
snow, and nighttime images. Here, we systematically develop and evaluate
several novel data-generation pipelines, including simulation-only, GAN-based,
and hybrid diffusion-GAN approaches, to synthesize photorealistic adverse
images from labelled clear images. We leverage an existing DA GAN, extend it to
support auxiliary inputs, and develop a novel training recipe that leverages
both simulated and real images. The simulated images facilitate exact
supervision by providing perfectly matched image pairs, while the real images
help bridge the simulation-to-real (sim2real) gap. We further introduce a
method to mitigate hallucinations and artifacts in Stable-Diffusion
Image-to-Image (img2img) outputs by blending them adaptively with their
progenitor images. We finetune downstream models on our synthetic data and
evaluate them on the Adverse Conditions Dataset with Correspondences (ACDC). We
achieve 1.85 percent overall improvement in semantic segmentation, and 4.62
percent on nighttime, demonstrating the efficacy of our hybrid method for
robust AD perception under challenging conditions.
☆ Breaking the SFT Plateau: Multimodal Structured Reinforcement Learning for Chart-to-Code Generation
While reinforcement learning (RL) has proven highly effective for general
reasoning in vision-language models, its application to tasks requiring
in-depth understanding of information-rich images and generation of structured
outputs remains underexplored. Chart-to-code generation exemplifies this
challenge, demanding complex reasoning over visual charts to generate
structured code. Supervised fine-tuning (SFT) alone is often insufficient,
highlighting the need for effective RL strategies that appropriately reward
structured outputs. We systematically investigate the performance plateau in
SFT through large-scale experiments and propose Multimodal Structured
Reinforcement Learning (MSRL) for chart-to-code generation, which substantially
breaks through this plateau. We construct the largest training corpus to date,
containing 3 million chart-code pairs from real-world arXiv tables to mitigate
simplistic patterns of prior synthetic data. Despite reaching state-of-the-art
performance, our experiments show that scaling SFT data eventually hits a
plateau where further increases yield negligible improvements. Our MSRL method
leverages a multi-granularity structured reward system using multimodal textual
and visual feedback. At the textual level, rule-based rewards validate
fine-grained code details. At the visual level, model-based rewards assess
structural similarity by rendering generated code into images and employing an
evaluator model. We implement this within a two-stage curriculum for training
stability. Results demonstrate that MSRL significantly breaks the SFT plateau,
improving high-level metrics by 6.2% and 9.9% on ChartMimic and ReachQA
benchmarks respectively, achieving competitive performance with advanced
closed-source models.
comment: technical report
☆ Temporal-Conditional Referring Video Object Segmentation with Noise-Free Text-to-Video Diffusion Model
Referring Video Object Segmentation (RVOS) aims to segment specific objects
in a video according to textual descriptions. We observe that recent RVOS
approaches often place excessive emphasis on feature extraction and temporal
modeling, while relatively neglecting the design of the segmentation head. In
fact, there remains considerable room for improvement in segmentation head
design. To address this, we propose a Temporal-Conditional Referring Video
Object Segmentation model, which innovatively integrates existing segmentation
methods to effectively enhance boundary segmentation capability. Furthermore,
our model leverages a text-to-video diffusion model for feature extraction. On
top of this, we remove the traditional noise prediction module to avoid the
randomness of noise from degrading segmentation accuracy, thereby simplifying
the model while improving performance. Finally, to overcome the limited feature
extraction capability of the VAE, we design a Temporal Context Mask Refinement
(TCMR) module, which significantly improves segmentation quality without
introducing complex designs. We evaluate our method on four public RVOS
benchmarks, where it consistently achieves state-of-the-art performance.
comment: 11 pages, 7 figures
☆ Generative Model-Based Feature Attention Module for Video Action Analysis
Video action analysis is a foundational technology within the realm of
intelligent video comprehension, particularly concerning its application in
Internet of Things(IoT). However, existing methodologies overlook feature
semantics in feature extraction and focus on optimizing action proposals, thus
these solutions are unsuitable for widespread adoption in high-performance IoT
applications due to the limitations in precision, such as autonomous driving,
which necessitate robust and scalable intelligent video analytics analysis. To
address this issue, we propose a novel generative attention-based model to
learn the relation of feature semantics. Specifically, by leveraging the
differences of actions' foreground and background, our model simultaneously
learns the frame- and segment-dependencies of temporal action feature
semantics, which takes advantage of feature semantics in the feature extraction
effectively. To evaluate the effectiveness of our model, we conduct extensive
experiments on two benchmark video task, action recognition and action
detection. In the context of action detection tasks, we substantiate the
superiority of our approach through comprehensive validation on widely
recognized datasets. Moreover, we extend the validation of the effectiveness of
our proposed method to a broader task, video action recognition. Our code is
available at https://github.com/Generative-Feature-Model/GAF.
☆ The 9th AI City Challenge ICCV 2025
Zheng Tang, Shuo Wang, David C. Anastasiu, Ming-Ching Chang, Anuj Sharma, Quan Kong, Norimasa Kobori, Munkhjargal Gochoo, Ganzorig Batnasan, Munkh-Erdene Otgonbold, Fady Alnajjar, Jun-Wei Hsieh, Tomasz Kornuta, Xiaolong Li, Yilin Zhao, Han Zhang, Subhashree Radhakrishnan, Arihant Jain, Ratnesh Kumar, Vidya N. Murali, Yuxing Wang, Sameer Satish Pusegaonkar, Yizhou Wang, Sujit Biswas, Xunlei Wu, Zhedong Zheng, Pranamesh Chakraborty, Rama Chellappa
The ninth AI City Challenge continues to advance real-world applications of
computer vision and AI in transportation, industrial automation, and public
safety. The 2025 edition featured four tracks and saw a 17% increase in
participation, with 245 teams from 15 countries registered on the evaluation
server. Public release of challenge datasets led to over 30,000 downloads to
date. Track 1 focused on multi-class 3D multi-camera tracking, involving
people, humanoids, autonomous mobile robots, and forklifts, using detailed
calibration and 3D bounding box annotations. Track 2 tackled video question
answering in traffic safety, with multi-camera incident understanding enriched
by 3D gaze labels. Track 3 addressed fine-grained spatial reasoning in dynamic
warehouse environments, requiring AI systems to interpret RGB-D inputs and
answer spatial questions that combine perception, geometry, and language. Both
Track 1 and Track 3 datasets were generated in NVIDIA Omniverse. Track 4
emphasized efficient road object detection from fisheye cameras, supporting
lightweight, real-time deployment on edge devices. The evaluation framework
enforced submission limits and used a partially held-out test set to ensure
fair benchmarking. Final rankings were revealed after the competition
concluded, fostering reproducibility and mitigating overfitting. Several teams
achieved top-tier results, setting new benchmarks in multiple tasks.
comment: Summary of the 9th AI City Challenge Workshop in conjunction with
ICCV 2025
☆ Learnable SMPLify: A Neural Solution for Optimization-Free Human Pose Inverse Kinematics
In 3D human pose and shape estimation, SMPLify remains a robust baseline that
solves inverse kinematics (IK) through iterative optimization. However, its
high computational cost limits its practicality. Recent advances across domains
have shown that replacing iterative optimization with data-driven neural
networks can achieve significant runtime improvements without sacrificing
accuracy. Motivated by this trend, we propose Learnable SMPLify, a neural
framework that replaces the iterative fitting process in SMPLify with a
single-pass regression model. The design of our framework targets two core
challenges in neural IK: data construction and generalization. To enable
effective training, we propose a temporal sampling strategy that constructs
initialization-target pairs from sequential frames. To improve generalization
across diverse motions and unseen poses, we propose a human-centric
normalization scheme and residual learning to narrow the solution space.
Learnable SMPLify supports both sequential inference and plug-in
post-processing to refine existing image-based estimators. Extensive
experiments demonstrate that our method establishes itself as a practical and
simple baseline: it achieves nearly 200x faster runtime compared to SMPLify,
generalizes well to unseen 3DPW and RICH, and operates in a model-agnostic
manner when used as a plug-in tool on LucidAction. The code is available at
https://github.com/Charrrrrlie/Learnable-SMPLify.
☆ DictAS: A Framework for Class-Generalizable Few-Shot Anomaly Segmentation via Dictionary Lookup ICCV 2025
Zhen Qu, Xian Tao, Xinyi Gong, ShiChen Qu, Xiaopei Zhang, Xingang Wang, Fei Shen, Zhengtao Zhang, Mukesh Prasad, Guiguang Ding
Recent vision-language models (e.g., CLIP) have demonstrated remarkable
class-generalizable ability to unseen classes in few-shot anomaly segmentation
(FSAS), leveraging supervised prompt learning or fine-tuning on seen classes.
However, their cross-category generalization largely depends on prior knowledge
of real seen anomaly samples. In this paper, we propose a novel framework,
namely DictAS, which enables a unified model to detect visual anomalies in
unseen object categories without any retraining on the target data, only
employing a few normal reference images as visual prompts. The insight behind
DictAS is to transfer dictionary lookup capabilities to the FSAS task for
unseen classes via self-supervised learning, instead of merely memorizing the
normal and abnormal feature patterns from the training set. Specifically,
DictAS mainly consists of three components: (1) **Dictionary Construction** -
to simulate the index and content of a real dictionary using features from
normal reference images. (2) **Dictionary Lookup** - to retrieve queried region
features from the dictionary via a sparse lookup strategy. When a query feature
cannot be retrieved, it is classified as an anomaly. (3) **Query Discrimination
Regularization**- to enhance anomaly discrimination by making abnormal features
harder to retrieve from the dictionary. To achieve this, Contrastive Query
Constraint and Text Alignment Constraint are further proposed. Extensive
experiments on seven public industrial and medical datasets demonstrate that
DictAS consistently outperforms state-of-the-art FSAS methods.
comment: Accepted by ICCV 2025, Project: https://github.com/xiaozhen228/DictAS
☆ Color Spike Data Generation via Bio-inspired Neuron-like Encoding with an Artificial Photoreceptor Layer
In recent years, neuromorphic computing and spiking neural networks (SNNs)
have ad-vanced rapidly through integration with deep learning. However, the
performance of SNNs still lags behind that of convolutional neural networks
(CNNs), primarily due to the limited information capacity of spike-based data.
Although some studies have attempted to improve SNN performance by training
them with non-spiking inputs such as static images, this approach deviates from
the original intent of neuromorphic computing, which emphasizes spike-based
information processing. To address this issue, we propose a Neuron-like
Encoding method that generates spike data based on the intrinsic operational
principles and functions of biological neurons. This method is further enhanced
by the incorporation of an artificial pho-toreceptor layer, enabling spike data
to carry both color and luminance information, thereby forming a complete
visual spike signal. Experimental results using the Integrate-and-Fire neuron
model demonstrate that this biologically inspired approach effectively
increases the information content of spike signals and improves SNN
performance, all while adhering to neuromorphic principles. We believe this
concept holds strong potential for future development and may contribute to
overcoming current limitations in neuro-morphic computing, facilitating broader
applications of SNNs.
comment: 14 pages, 11 figures
☆ A Lightweight Dual-Mode Optimization for Generative Face Video Coding
Generative Face Video Coding (GFVC) achieves superior rate-distortion
performance by leveraging the strong inference capabilities of deep generative
models. However, its practical deployment is hindered by large model parameters
and high computational costs. To address this, we propose a lightweight GFVC
framework that introduces dual-mode optimization -- combining architectural
redesign and operational refinement -- to reduce complexity whilst preserving
reconstruction quality. Architecturally, we replace traditional 3 x 3
convolutions with slimmer and more efficient layers, reducing complexity
without compromising feature expressiveness. Operationally, we develop a
two-stage adaptive channel pruning strategy: (1) soft pruning during training
identifies redundant channels via learnable thresholds, and (2) hard pruning
permanently eliminates these channels post-training using a derived mask. This
dual-phase approach ensures both training stability and inference efficiency.
Experimental results demonstrate that the proposed lightweight dual-mode
optimization for GFVC can achieve 90.4% parameter reduction and 88.9%
computation saving compared to the baseline, whilst achieving superior
performance compared to state-of-the-art video coding standard Versatile Video
Coding (VVC) in terms of perceptual-level quality metrics. As such, the
proposed method is expected to enable efficient GFVC deployment in
resource-constrained environments such as mobile edge devices.
☆ GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering
Foveated rendering significantly reduces computational demands in virtual
reality applications by concentrating rendering quality where users focus their
gaze. Current approaches require expensive hardware-based eye tracking systems,
limiting widespread adoption due to cost, calibration complexity, and hardware
compatibility constraints. This paper presents GazeProphet, a software-only
approach for predicting gaze locations in VR environments without requiring
dedicated eye tracking hardware. The approach combines a Spherical Vision
Transformer for processing 360-degree VR scenes with an LSTM-based temporal
encoder that captures gaze sequence patterns. A multi-modal fusion network
integrates spatial scene features with temporal gaze dynamics to predict future
gaze locations with associated confidence estimates. Experimental evaluation on
a comprehensive VR dataset demonstrates that GazeProphet achieves a median
angular error of 3.83 degrees, outperforming traditional saliency-based
baselines by 24% while providing reliable confidence calibration. The approach
maintains consistent performance across different spatial regions and scene
types, enabling practical deployment in VR systems without additional hardware
requirements. Statistical analysis confirms the significance of improvements
across all evaluation metrics. These results show that software-only gaze
prediction can work for VR foveated rendering, making this performance boost
more accessible to different VR platforms and apps.
comment: 8 pages, 3 figures
☆ FLAIR: Frequency- and Locality-Aware Implicit Neural Representations
Implicit Neural Representations (INRs) leverage neural networks to map
coordinates to corresponding signals, enabling continuous and compact
representations. This paradigm has driven significant advances in various
vision tasks. However, existing INRs lack frequency selectivity, spatial
localization, and sparse representations, leading to an over-reliance on
redundant signal components. Consequently, they exhibit spectral bias, tending
to learn low-frequency components early while struggling to capture fine
high-frequency details. To address these issues, we propose FLAIR (Frequency-
and Locality-Aware Implicit Neural Representations), which incorporates two key
innovations. The first is RC-GAUSS, a novel activation designed for explicit
frequency selection and spatial localization under the constraints of the
time-frequency uncertainty principle (TFUP). The second is
Wavelet-Energy-Guided Encoding (WEGE), which leverages the discrete wavelet
transform (DWT) to compute energy scores and explicitly guide frequency
information to the network. Our method consistently outperforms existing INRs
in 2D image representation and restoration, as well as 3D reconstruction.
comment: Please visit our project page at https://cmlab-korea.github.io/FLAIR/
☆ EAvatar: Expression-Aware Head Avatar Reconstruction with Generative Geometry Priors
High-fidelity head avatar reconstruction plays a crucial role in AR/VR,
gaming, and multimedia content creation. Recent advances in 3D Gaussian
Splatting (3DGS) have demonstrated effectiveness in modeling complex geometry
with real-time rendering capability and are now widely used in high-fidelity
head avatar reconstruction tasks. However, existing 3DGS-based methods still
face significant challenges in capturing fine-grained facial expressions and
preserving local texture continuity, especially in highly deformable regions.
To mitigate these limitations, we propose a novel 3DGS-based framework termed
EAvatar for head reconstruction that is both expression-aware and
deformation-aware. Our method introduces a sparse expression control mechanism,
where a small number of key Gaussians are used to influence the deformation of
their neighboring Gaussians, enabling accurate modeling of local deformations
and fine-scale texture transitions. Furthermore, we leverage high-quality 3D
priors from pretrained generative models to provide a more reliable facial
geometry, offering structural guidance that improves convergence stability and
shape accuracy during training. Experimental results demonstrate that our
method produces more accurate and visually coherent head reconstructions with
improved expression controllability and detail fidelity.
comment: 20 pages, 11 figures
☆ MimicFunc: Imitating Tool Manipulation from a Single Human Video via Functional Correspondence
Imitating tool manipulation from human videos offers an intuitive approach to
teaching robots, while also providing a promising and scalable alternative to
labor-intensive teleoperation data collection for visuomotor policy learning.
While humans can mimic tool manipulation behavior by observing others perform a
task just once and effortlessly transfer the skill to diverse tools for
functionally equivalent tasks, current robots struggle to achieve this level of
generalization. A key challenge lies in establishing function-level
correspondences, considering the significant geometric variations among
functionally similar tools, referred to as intra-function variations. To
address this challenge, we propose MimicFunc, a framework that establishes
functional correspondences with function frame, a function-centric local
coordinate frame constructed with keypoint-based abstraction, for imitating
tool manipulation skills. Experiments demonstrate that MimicFunc effectively
enables the robot to generalize the skill from a single RGB-D human video to
manipulating novel tools for functionally equivalent tasks. Furthermore,
leveraging MimicFunc's one-shot generalization capability, the generated
rollouts can be used to train visuomotor policies without requiring
labor-intensive teleoperation data collection for novel objects. Our code and
video are available at https://sites.google.com/view/mimicfunc.
comment: Accepted to CoRL 2025
☆ Evaluating Open-Source Vision Language Models for Facial Emotion Recognition against Traditional Deep Learning Models
Facial Emotion Recognition (FER) is crucial for applications such as
human-computer interaction and mental health diagnostics. This study presents
the first empirical comparison of open-source Vision-Language Models (VLMs),
including Phi-3.5 Vision and CLIP, against traditional deep learning models
VGG19, ResNet-50, and EfficientNet-B0 on the challenging FER-2013 dataset,
which contains 35,887 low-resolution grayscale images across seven emotion
classes. To address the mismatch between VLM training assumptions and the noisy
nature of FER data, we introduce a novel pipeline that integrates GFPGAN-based
image restoration with FER evaluation. Results show that traditional models,
particularly EfficientNet-B0 (86.44%) and ResNet-50 (85.72%), significantly
outperform VLMs like CLIP (64.07%) and Phi-3.5 Vision (51.66%), highlighting
the limitations of VLMs in low-quality visual tasks. In addition to performance
evaluation using precision, recall, F1-score, and accuracy, we provide a
detailed computational cost analysis covering preprocessing, training,
inference, and evaluation phases, offering practical insights for deployment.
This work underscores the need for adapting VLMs to noisy environments and
provides a reproducible benchmark for future research in emotion recognition.
☆ Calibrating Biased Distribution in VFM-derived Latent Space via Cross-Domain Geometric Consistency CVPR
Despite the fast progress of deep learning, one standing challenge is the gap
of the observed training samples and the underlying true distribution. There
are multiple reasons for the causing of this gap e.g. sampling bias, noise etc.
In the era of foundation models, we show that when leveraging the off-the-shelf
(vision) foundation models (e.g., CLIP, DINOv2) for feature extraction, the
geometric shapes of the resulting feature distributions exhibit remarkable
transferability across domains and datasets. To verify its practical
usefulness, we embody our geometric knowledge-guided distribution calibration
framework in two popular and challenging settings: federated learning and
long-tailed recognition. In the federated setting, we devise a technique of
acquiring the global geometric shape under privacy constraints, then leverage
this knowledge to generate new samples for clients, in the aim of bridging the
gap between local and global observations. In long-tailed learning, it utilizes
the geometric knowledge transferred from sample-rich categories to recover the
true distribution for sample-scarce tail classes. Comprehensive experiments
show that our proposed geometric knowledge-guided distribution calibration
effectively overcomes information deficits caused by data heterogeneity and
sample imbalance, with boosted performance across benchmarks.
comment: 15 pages, CVPR Oral
☆ 2D Gaussians Meet Visual Tokenizer
The image tokenizer is a critical component in AR image generation, as it
determines how rich and structured visual content is encoded into compact
representations. Existing quantization-based tokenizers such as VQ-GAN
primarily focus on appearance features like texture and color, often neglecting
geometric structures due to their patch-based design. In this work, we explored
how to incorporate more visual information into the tokenizer and proposed a
new framework named Visual Gaussian Quantization (VGQ), a novel tokenizer
paradigm that explicitly enhances structural modeling by integrating 2D
Gaussians into traditional visual codebook quantization frameworks. Our
approach addresses the inherent limitations of naive quantization methods such
as VQ-GAN, which struggle to model structured visual information due to their
patch-based design and emphasis on texture and color. In contrast, VGQ encodes
image latents as 2D Gaussian distributions, effectively capturing geometric and
spatial structures by directly modeling structure-related parameters such as
position, rotation and scale. We further demonstrate that increasing the
density of 2D Gaussians within the tokens leads to significant gains in
reconstruction fidelity, providing a flexible trade-off between token
efficiency and visual richness. On the ImageNet 256x256 benchmark, VGQ achieves
strong reconstruction quality with an rFID score of 1.00. Furthermore, by
increasing the density of 2D Gaussians within the tokens, VGQ gains a
significant boost in reconstruction capability and achieves a state-of-the-art
reconstruction rFID score of 0.556 and a PSNR of 24.93, substantially
outperforming existing methods. Codes will be released soon.
☆ Bridging the Gap: Doubles Badminton Analysis with Singles-Trained Models
Badminton is known as one of the fastest racket sports in the world. Despite
doubles matches being more prevalent in international tournaments than singles,
previous research has mainly focused on singles due to the challenges in data
availability and multi-person tracking. To address this gap, we designed an
approach that transfers singles-trained models to doubles analysis. We
extracted keypoints from the ShuttleSet single matches dataset using ViT-Pose
and embedded them through a contrastive learning framework based on ST-GCN. To
improve tracking stability, we incorporated a custom multi-object tracking
algorithm that resolves ID switching issues from fast and overlapping player
movements. A Transformer-based classifier then determines shot occurrences
based on the learned embeddings. Our findings demonstrate the feasibility of
extending pose-based shot recognition to doubles badminton, broadening
analytics capabilities. This work establishes a foundation for doubles-specific
datasets to enhance understanding of this predominant yet understudied format
of the fast racket sport.
comment: 14 pages, 7 figures
☆ AdaptiveAE: An Adaptive Exposure Strategy for HDR Capturing in Dynamic Scenes ICCV 2025
Mainstream high dynamic range imaging techniques typically rely on fusing
multiple images captured with different exposure setups (shutter speed and
ISO). A good balance between shutter speed and ISO is crucial for achieving
high-quality HDR, as high ISO values introduce significant noise, while long
shutter speeds can lead to noticeable motion blur. However, existing methods
often overlook the complex interaction between shutter speed and ISO and fail
to account for motion blur effects in dynamic scenes.
In this work, we propose AdaptiveAE, a reinforcement learning-based method
that optimizes the selection of shutter speed and ISO combinations to maximize
HDR reconstruction quality in dynamic environments. AdaptiveAE integrates an
image synthesis pipeline that incorporates motion blur and noise simulation
into our training procedure, leveraging semantic information and exposure
histograms. It can adaptively select optimal ISO and shutter speed sequences
based on a user-defined exposure time budget, and find a better exposure
schedule than traditional solutions. Experimental results across multiple
datasets demonstrate that it achieves the state-of-the-art performance.
comment: Accepted to ICCV 2025
☆ Multi-view Clustering via Bi-level Decoupling and Consistency Learning
Multi-view clustering has shown to be an effective method for analyzing
underlying patterns in multi-view data. The performance of clustering can be
improved by learning the consistency and complementarity between multi-view
features, however, cluster-oriented representation learning is often
overlooked. In this paper, we propose a novel Bi-level Decoupling and
Consistency Learning framework (BDCL) to further explore the effective
representation for multi-view data to enhance inter-cluster discriminability
and intra-cluster compactness of features in multi-view clustering. Our
framework comprises three modules: 1) The multi-view instance learning module
aligns the consistent information while preserving the private features between
views through reconstruction autoencoder and contrastive learning. 2) The
bi-level decoupling of features and clusters enhances the discriminability of
feature space and cluster space. 3) The consistency learning module treats the
different views of the sample and their neighbors as positive pairs, learns the
consistency of their clustering assignments, and further compresses the
intra-cluster space. Experimental results on five benchmark datasets
demonstrate the superiority of the proposed method compared with the SOTA
methods. Our code is published on https://github.com/LouisDong95/BDCL.
☆ ROVER: Robust Loop Closure Verification with Trajectory Prior in Repetitive Environments
Loop closure detection is important for simultaneous localization and mapping
(SLAM), which associates current observations with historical keyframes,
achieving drift correction and global relocalization. However, a falsely
detected loop can be fatal, and this is especially difficult in repetitive
environments where appearance-based features fail due to the high similarity.
Therefore, verification of a loop closure is a critical step in avoiding false
positive detections. Existing works in loop closure verification predominantly
focus on learning invariant appearance features, neglecting the prior knowledge
of the robot's spatial-temporal motion cue, i.e., trajectory. In this letter,
we propose ROVER, a loop closure verification method that leverages the
historical trajectory as a prior constraint to reject false loops in
challenging repetitive environments. For each loop candidate, it is first used
to estimate the robot trajectory with pose-graph optimization. This trajectory
is then submitted to a scoring scheme that assesses its compliance with the
trajectory without the loop, which we refer to as the trajectory prior, to
determine if the loop candidate should be accepted. Benchmark comparisons and
real-world experiments demonstrate the effectiveness of the proposed method.
Furthermore, we integrate ROVER into state-of-the-art SLAM systems to verify
its robustness and efficiency. Our source code and self-collected dataset are
available at https://github.com/jarvisyjw/ROVER.
comment: 8 pages, 9 figures
☆ CORENet: Cross-Modal 4D Radar Denoising Network with LiDAR Supervision for Autonomous Driving IROS 2025
4D radar-based object detection has garnered great attention for its
robustness in adverse weather conditions and capacity to deliver rich spatial
information across diverse driving scenarios. Nevertheless, the sparse and
noisy nature of 4D radar point clouds poses substantial challenges for
effective perception. To address the limitation, we present CORENet, a novel
cross-modal denoising framework that leverages LiDAR supervision to identify
noise patterns and extract discriminative features from raw 4D radar data.
Designed as a plug-and-play architecture, our solution enables seamless
integration into voxel-based detection frameworks without modifying existing
pipelines. Notably, the proposed method only utilizes LiDAR data for
cross-modal supervision during training while maintaining full radar-only
operation during inference. Extensive evaluation on the challenging Dual-Radar
dataset, which is characterized by elevated noise level, demonstrates the
effectiveness of our framework in enhancing detection robustness. Comprehensive
experiments validate that CORENet achieves superior performance compared to
existing mainstream approaches.
comment: 8 pages, 5 figures, Accepted to IROS 2025
☆ FAMNet: Integrating 2D and 3D Features for Micro-expression Recognition via Multi-task Learning and Hierarchical Attention IJCNN 2025
Micro-expressions recognition (MER) has essential application value in many
fields, but the short duration and low intensity of micro-expressions (MEs)
bring considerable challenges to MER. The current MER methods in deep learning
mainly include three data loading methods: static images, dynamic image
sequence, and a combination of the two streams. How to effectively extract MEs'
fine-grained and spatiotemporal features has been difficult to solve. This
paper proposes a new MER method based on multi-task learning and hierarchical
attention, which fully extracts MEs' omni-directional features by merging 2D
and 3D CNNs. The fusion model consists of a 2D CNN AMNet2D and a 3D CNN
AMNet3D, with similar structures consisting of a shared backbone network
Resnet18 and attention modules. During training, the model adopts different
data loading methods to adapt to two specific networks respectively, jointly
trains on the tasks of MER and facial action unit detection (FAUD), and adopts
the parameter hard sharing for information association, which further improves
the effect of the MER task, and the final fused model is called FAMNet.
Extensive experimental results show that our proposed FAMNet significantly
improves task performance. On the SAMM, CASME II and MMEW datasets, FAMNet
achieves 83.75% (UAR) and 84.03% (UF1). Furthermore, on the challenging
CAS(ME)$^3$ dataset, FAMNet achieves 51% (UAR) and 43.42% (UF1).
comment: 8 pages, 6 figures. Accepted to IJCNN 2025
☆ Towards Understanding and Harnessing the Transferability of Prognostic Knowledge in Computational Pathology
Whole-Slide Image (WSI) is an important tool for evaluating the prognosis of
cancer patients. Present WSI-based prognosis studies generally follow a
conventional paradigm -- cancer-specific model development -- where one cancer
disease corresponds to one model and this model cannot make use of the
prognostic knowledge from others. Despite its notable success in recent years,
this paradigm has inherent limitations and has always been struggling with
practical requirements: (i) scaling to the rare tumor diseases with very
limited samples and (ii) benefiting from the generalizable prognostic knowledge
in other cancers. To this end, this paper presents the first systematic study
on Prognostic Knowledge Transfer in Pathology, called Path-PKT. It comprises
three main parts. (1) We curate a large dataset (UNI2-h-DSS) with 13 cancers
and use it to evaluate the transferability of prognostic knowledge between
different cancers computationally. (2) We design experiments to understand what
factors affect knowledge transfer and what causes positive transfers. (3)
Motivated by empirical findings, we propose a new baseline approach (MoE-PKT)
with a routing mechanism to utilize the generalizable prognostic knowledge in
other cancers. Finally, we show the transferability of source models to rare
tumor diseases. This study could lay solid foundations for the study of
knowledge transfer in WSI-based cancer prognosis. Source code is available at
https://github.com/liupei101/Path-PKT.
comment: 15 pages (13 figures and 5 tables)
☆ Enhancing Robustness of Implicit Neural Representations Against Weight Perturbations
Implicit Neural Representations (INRs) encode discrete signals in a
continuous manner using neural networks, demonstrating significant value across
various multimedia applications. However, the vulnerability of INRs presents a
critical challenge for their real-world deployments, as the network weights
might be subjected to unavoidable perturbations. In this work, we investigate
the robustness of INRs for the first time and find that even minor
perturbations can lead to substantial performance degradation in the quality of
signal reconstruction. To mitigate this issue, we formulate the robustness
problem in INRs by minimizing the difference between loss with and without
weight perturbations. Furthermore, we derive a novel robust loss function to
regulate the gradient of the reconstruction loss with respect to weights,
thereby enhancing the robustness. Extensive experiments on reconstruction tasks
across multiple modalities demonstrate that our method achieves up to a 7.5~dB
improvement in peak signal-to-noise ratio (PSNR) values compared to original
INRs under noisy conditions.
comment: 4 pages, 7 figures
☆ AIM 2025 challenge on Inverse Tone Mapping Report: Methods and Results
Chao Wang, Francesco Banterle, Bin Ren, Radu Timofte, Xin Lu, Yufeng Peng, Chengjie Ge, Zhijing Sun, Ziang Zhou, Zihao Li, Zishun Liao, Qiyu Kang, Xueyang Fu, Zheng-Jun Zha, Zhijing Sun, Xingbo Wang, Kean Liu, Senyan Xu, Yang Qiu, Yifan Ding, Gabriel Eilertsen, Jonas Unger, Zihao Wang, Ke Wu, Jinshan Pan, Zhen Liu, Zhongyang Li, Shuaicheng Liu, S. M Nadim Uddin
This paper presents a comprehensive review of the AIM 2025 Challenge on
Inverse Tone Mapping (ITM). The challenge aimed to push forward the development
of effective ITM algorithms for HDR image reconstruction from single LDR
inputs, focusing on perceptual fidelity and numerical consistency. A total of
\textbf{67} participants submitted \textbf{319} valid results, from which the
best five teams were selected for detailed analysis. This report consolidates
their methodologies and performance, with the lowest PU21-PSNR among the top
entries reaching 29.22 dB. The analysis highlights innovative strategies for
enhancing HDR reconstruction quality and establishes strong benchmarks to guide
future research in inverse tone mapping.
☆ Distribution-Aware Hadamard Quantization for Hardware-Efficient Implicit Neural Representations
Implicit Neural Representations (INRs) encode discrete signals using
Multi-Layer Perceptrons (MLPs) with complex activation functions. While INRs
achieve superior performance, they depend on full-precision number
representation for accurate computation, resulting in significant hardware
overhead. Previous INR quantization approaches have primarily focused on weight
quantization, offering only limited hardware savings due to the lack of
activation quantization. To fully exploit the hardware benefits of
quantization, we propose DHQ, a novel distribution-aware Hadamard quantization
scheme that targets both weights and activations in INRs. Our analysis shows
that the weights in the first and last layers have distributions distinct from
those in the intermediate layers, while the activations in the last layer
differ significantly from those in the preceding layers. Instead of customizing
quantizers individually, we utilize the Hadamard transformation to standardize
these diverse distributions into a unified bell-shaped form, supported by both
empirical evidence and theoretical analysis, before applying a standard
quantizer. To demonstrate the practical advantages of our approach, we present
an FPGA implementation of DHQ that highlights its hardware efficiency.
Experiments on diverse image reconstruction tasks show that DHQ outperforms
previous quantization methods, reducing latency by 32.7\%, energy consumption
by 40.1\%, and resource utilization by up to 98.3\% compared to full-precision
counterparts.
comment: 6 pages, 7 figures
☆ MINR: Efficient Implicit Neural Representations for Multi-Image Encoding
Implicit Neural Representations (INRs) aim to parameterize discrete signals
through implicit continuous functions. However, formulating each image with a
separate neural network~(typically, a Multi-Layer Perceptron (MLP)) leads to
computational and storage inefficiencies when encoding multi-images. To address
this issue, we propose MINR, sharing specific layers to encode multi-image
efficiently. We first compare the layer-wise weight distributions for several
trained INRs and find that corresponding intermediate layers follow highly
similar distribution patterns. Motivated by this, we share these intermediate
layers across multiple images while preserving the input and output layers as
input-specific. In addition, we design an extra novel projection layer for each
image to capture its unique features. Experimental results on image
reconstruction and super-resolution tasks demonstrate that MINR can save up to
60\% parameters while maintaining comparable performance. Particularly, MINR
scales effectively to handle 100 images, maintaining an average peak
signal-to-noise ratio (PSNR) of 34 dB. Further analysis of various backbones
proves the robustness of the proposed MINR.
comment: 4 pages, 4 figures
☆ STER-VLM: Spatio-Temporal With Enhanced Reference Vision-Language Models ICCV
Vision-language models (VLMs) have emerged as powerful tools for enabling
automated traffic analysis; however, current approaches often demand
substantial computational resources and struggle with fine-grained
spatio-temporal understanding. This paper introduces STER-VLM, a
computationally efficient framework that enhances VLM performance through (1)
caption decomposition to tackle spatial and temporal information separately,
(2) temporal frame selection with best-view filtering for sufficient temporal
information, and (3) reference-driven understanding for capturing fine-grained
motion and dynamic context and (4) curated visual/textual prompt techniques.
Experimental results on the WTS \cite{kong2024wts} and BDD \cite{BDD} datasets
demonstrate substantial gains in semantic richness and traffic scene
interpretation. Our framework is validated through a decent test score of
55.655 in the AI City Challenge 2025 Track 2, showing its effectiveness in
advancing resource-efficient and accurate traffic analysis for real-world
applications.
comment: ICCV Workshop 2025
☆ Vision Transformers for Kidney Stone Image Classification: A Comparative Study with CNNs
Ivan Reyes-Amezcua, Francisco Lopez-Tiro, Clement Larose, Andres Mendez-Vazquez, Gilberto Ochoa-Ruiz, Christian Daul
Kidney stone classification from endoscopic images is critical for
personalized treatment and recurrence prevention. While convolutional neural
networks (CNNs) have shown promise in this task, their limited ability to
capture long-range dependencies can hinder performance under variable imaging
conditions. This study presents a comparative analysis between Vision
Transformers (ViTs) and CNN-based models, evaluating their performance on two
ex vivo datasets comprising CCD camera and flexible ureteroscope images. The
ViT-base model pretrained on ImageNet-21k consistently outperformed a ResNet50
baseline across multiple imaging conditions. For instance, in the most visually
complex subset (Section patches from endoscopic images), the ViT model achieved
95.2% accuracy and 95.1% F1-score, compared to 64.5% and 59.3% with ResNet50.
In the mixed-view subset from CCD-camera images, ViT reached 87.1% accuracy
versus 78.4% with CNN. These improvements extend across precision and recall as
well. The results demonstrate that ViT-based architectures provide superior
classification performance and offer a scalable alternative to conventional
CNNs for kidney stone image analysis.
☆ Revisiting MLLM Token Technology through the Lens of Classical Visual Coding
Jinming Liu, Junyan Lin, Yuntao Wei, Kele Shao, Keda Tao, Jianguo Huang, Xudong Yang, Zhibo Chen, Huan Wang, Xin Jin
Classical visual coding and Multimodal Large Language Model (MLLM) token
technology share the core objective - maximizing information fidelity while
minimizing computational cost. Therefore, this paper reexamines MLLM token
technology, including tokenization, token compression, and token reasoning,
through the established principles of long-developed visual coding area. From
this perspective, we (1) establish a unified formulation bridging token
technology and visual coding, enabling a systematic, module-by-module
comparative analysis; (2) synthesize bidirectional insights, exploring how
visual coding principles can enhance MLLM token techniques' efficiency and
robustness, and conversely, how token technology paradigms can inform the
design of next-generation semantic visual codecs; (3) prospect for promising
future research directions and critical unsolved challenges. In summary, this
study presents the first comprehensive and structured technology comparison of
MLLM token and visual coding, paving the way for more efficient multimodal
models and more powerful visual codecs simultaneously.
☆ Hierarchy-Consistent Learning and Adaptive Loss Balancing for Hierarchical Multi-Label Classification CIKM 2025
Hierarchical Multi-Label Classification (HMC) faces critical challenges in
maintaining structural consistency and balancing loss weighting in Multi-Task
Learning (MTL). In order to address these issues, we propose a classifier
called HCAL based on MTL integrated with prototype contrastive learning and
adaptive task-weighting mechanisms. The most significant advantage of our
classifier is semantic consistency including both prototype with explicitly
modeling label and feature aggregation from child classes to parent classes.
The other important advantage is an adaptive loss-weighting mechanism that
dynamically allocates optimization resources by monitoring task-specific
convergence rates. It effectively resolves the "one-strong-many-weak"
optimization bias inherent in traditional MTL approaches. To further enhance
robustness, a prototype perturbation mechanism is formulated by injecting
controlled noise into prototype to expand decision boundaries. Additionally, we
formalize a quantitative metric called Hierarchical Violation Rate (HVR) as to
evaluate hierarchical consistency and generalization. Extensive experiments
across three datasets demonstrate both the higher classification accuracy and
reduced hierarchical violation rate of the proposed classifier over baseline
models.
comment: 10 pages, 7 figures, accepted by CIKM 2025
☆ EDTalk++: Full Disentanglement for Controllable Talking Head Synthesis
Achieving disentangled control over multiple facial motions and accommodating
diverse input modalities greatly enhances the application and entertainment of
the talking head generation. This necessitates a deep exploration of the
decoupling space for facial features, ensuring that they a) operate
independently without mutual interference and b) can be preserved to share with
different modal inputs, both aspects often neglected in existing methods. To
address this gap, this paper proposes EDTalk++, a novel full disentanglement
framework for controllable talking head generation. Our framework enables
individual manipulation of mouth shape, head pose, eye movement, and emotional
expression, conditioned on video or audio inputs. Specifically, we employ four
lightweight modules to decompose the facial dynamics into four distinct latent
spaces representing mouth, pose, eye, and expression, respectively. Each space
is characterized by a set of learnable bases whose linear combinations define
specific motions. To ensure independence and accelerate training, we enforce
orthogonality among bases and devise an efficient training strategy to allocate
motion responsibilities to each space without relying on external knowledge.
The learned bases are then stored in corresponding banks, enabling shared
visual priors with audio input. Furthermore, considering the properties of each
space, we propose an Audio-to-Motion module for audio-driven talking head
synthesis. Experiments are conducted to demonstrate the effectiveness of
EDTalk++.
comment: 17 pages,15 figures. arXiv admin note: substantial text overlap with
arXiv:2404.01647
♻ ☆ LoRA-Edit: Controllable First-Frame-Guided Video Editing via Mask-Aware LoRA Fine-Tuning
Video editing using diffusion models has achieved remarkable results in
generating high-quality edits for videos. However, current methods often rely
on large-scale pretraining, limiting flexibility for specific edits.
First-frame-guided editing provides control over the first frame, but lacks
flexibility over subsequent frames. To address this, we propose a mask-based
LoRA (Low-Rank Adaptation) tuning method that adapts pretrained Image-to-Video
(I2V) models for flexible video editing. Our key innovation is using a
spatiotemporal mask to strategically guide the LoRA fine-tuning process. This
teaches the model two distinct skills: first, to interpret the mask as a
command to either preserve content from the source video or generate new
content in designated regions. Second, for these generated regions, LoRA learns
to synthesize either temporally consistent motion inherited from the video or
novel appearances guided by user-provided reference frames. This
dual-capability LoRA grants users control over the edit's entire temporal
evolution, allowing complex transformations like an object rotating or a flower
blooming. Experimental results show our method achieves superior video editing
performance compared to baseline methods. Project Page:
https://cjeen.github.io/LoRAEdit
comment: 9 pages
♻ ☆ Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction ICCV 2025
We introduce Geo4D, a method to repurpose video diffusion models for
monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic
priors captured by large-scale pre-trained video models, Geo4D can be trained
using only synthetic data while generalizing well to real data in a zero-shot
manner. Geo4D predicts several complementary geometric modalities, namely
point, disparity, and ray maps. We propose a new multi-modal alignment
algorithm to align and fuse these modalities, as well as a sliding window
approach at inference time, thus enabling robust and accurate 4D reconstruction
of long videos. Extensive experiments across multiple benchmarks show that
Geo4D significantly surpasses state-of-the-art video depth estimation methods.
comment: 17 pages, 6 figures, ICCV 2025 Highlight, Project page:
https://geo4d.github.io/
♻ ☆ RadGPT: Constructing 3D Image-Text Tumor Datasets
Pedro R. A. S. Bassi, Mehmet Can Yavuz, Kang Wang, Xiaoxi Chen, Wenxuan Li, Sergio Decherchi, Andrea Cavalli, Yang Yang, Alan Yuille, Zongwei Zhou
Cancers identified in CT scans are usually accompanied by detailed radiology
reports, but publicly available CT datasets often lack these essential reports.
This absence limits their usefulness for developing accurate report generation
AI. To address this gap, we present AbdomenAtlas 3.0, the first public,
high-quality abdominal CT dataset with detailed, expert-reviewed radiology
reports. All reports are paired with per-voxel masks and they describe liver,
kidney and pancreatic tumors. AbdomenAtlas 3.0 has 9,262 triplets of CT, mask
and report--3,955 with tumors. These CT scans come from 17 public datasets.
Besides creating the reports for these datasets, we expanded their number of
tumor masks by 4.2x, identifying 3,011 new tumor cases. Notably, the reports in
AbdomenAtlas 3.0 are more standardized, and generated faster than traditional
human-made reports. They provide details like tumor size, location, attenuation
and surgical resectability. These reports were created by 12 board-certified
radiologists using our proposed RadGPT, a novel framework that converted
radiologist-revised tumor segmentation masks into structured and narrative
reports. Besides being a dataset creation tool, RadGPT can also become a
fully-automatic, segmentation-assisted report generation method. We benchmarked
this method and 5 state-of-the-art report generation vision-language models.
Our results show that segmentation strongly improves tumor detection in AI-made
reports.
♻ ☆ DNF-Avatar: Distilling Neural Fields for Real-time Animatable Avatar Relighting ICCV 2025
Creating relightable and animatable human avatars from monocular videos is a
rising research topic with a range of applications, e.g. virtual reality,
sports, and video games. Previous works utilize neural fields together with
physically based rendering (PBR), to estimate geometry and disentangle
appearance properties of human avatars. However, one drawback of these methods
is the slow rendering speed due to the expensive Monte Carlo ray tracing. To
tackle this problem, we proposed to distill the knowledge from implicit neural
fields (teacher) to explicit 2D Gaussian splatting (student) representation to
take advantage of the fast rasterization property of Gaussian splatting. To
avoid ray-tracing, we employ the split-sum approximation for PBR appearance. We
also propose novel part-wise ambient occlusion probes for shadow computation.
Shadow prediction is achieved by querying these probes only once per pixel,
which paves the way for real-time relighting of avatars. These techniques
combined give high-quality relighting results with realistic shadow effects.
Our experiments demonstrate that the proposed student model achieves comparable
or even better relighting results with our teacher model while being 370 times
faster at inference time, achieving a 67 FPS rendering speed.
comment: 17 pages, 9 figures, ICCV 2025 Findings Oral, Project pages:
https://jzr99.github.io/DNF-Avatar/
♻ ☆ Assessment of Using Synthetic Data in Brain Tumor Segmentation
Manual brain tumor segmentation from MRI scans is challenging due to tumor
heterogeneity, scarcity of annotated data, and class imbalance in medical
imaging datasets. Synthetic data generated by generative models has the
potential to mitigate these issues by improving dataset diversity. This study
investigates, as a proof of concept, the impact of incorporating synthetic MRI
data, generated using a pre-trained GAN model, into training a U-Net
segmentation network. Experiments were conducted using real data from the BraTS
2020 dataset, synthetic data generated with the medigan library, and hybrid
datasets combining real and synthetic samples in varying proportions. While
overall quantitative performance (Dice coefficient, IoU, precision, recall,
accuracy) was comparable between real-only and hybrid-trained models,
qualitative inspection suggested that hybrid datasets, particularly with 40%
real and 60% synthetic data, improved whole tumor boundary delineation.
However, region-wise accuracy for the tumor core and the enhancing tumor
remained lower, indicating a persistent class imbalance. The findings support
the feasibility of synthetic data as an augmentation strategy for brain tumor
segmentation, while highlighting the need for larger-scale experiments,
volumetric data consistency, and mitigating class imbalance in future work.
comment: Updates include improved references, clearer table column title, and
minor language corrections
♻ ☆ Slot Attention with Re-Initialization and Self-Distillation ACM MM 2025
Unlike popular solutions based on dense feature maps, Object-Centric Learning
(OCL) represents visual scenes as sub-symbolic object-level feature vectors,
termed slots, which are highly versatile for tasks involving visual modalities.
OCL typically aggregates object superpixels into slots by iteratively applying
competitive cross attention, known as Slot Attention, with the slots as the
query. However, once initialized, these slots are reused naively, causing
redundant slots to compete with informative ones for representing objects. This
often results in objects being erroneously segmented into parts. Additionally,
mainstream methods derive supervision signals solely from decoding slots into
the input's reconstruction, overlooking potential supervision based on internal
information. To address these issues, we propose Slot Attention with
re-Initialization and self-Distillation (DIAS): $\emph{i)}$ We reduce
redundancy in the aggregated slots and re-initialize extra aggregation to
update the remaining slots; $\emph{ii)}$ We drive the bad attention map at the
first aggregation iteration to approximate the good at the last iteration to
enable self-distillation. Experiments demonstrate that DIAS achieves
state-of-the-art on OCL tasks like object discovery and recognition, while also
improving advanced visual prediction and reasoning. Our source code and model
checkpoints are available on https://github.com/Genera1Z/DIAS.
comment: Accepted by ACM MM 2025
♻ ☆ AutoComPose: Automatic Generation of Pose Transition Descriptions for Composed Pose Retrieval Using Multimodal LLMs ICCV 2025
Yi-Ting Shen, Sungmin Eum, Doheon Lee, Rohit Shete, Chiao-Yi Wang, Heesung Kwon, Shuvra S. Bhattacharyya
Composed pose retrieval (CPR) enables users to search for human poses by
specifying a reference pose and a transition description, but progress in this
field is hindered by the scarcity and inconsistency of annotated pose
transitions. Existing CPR datasets rely on costly human annotations or
heuristic-based rule generation, both of which limit scalability and diversity.
In this work, we introduce AutoComPose, the first framework that leverages
multimodal large language models (MLLMs) to automatically generate rich and
structured pose transition descriptions. Our method enhances annotation quality
by structuring transitions into fine-grained body part movements and
introducing mirrored/swapped variations, while a cyclic consistency constraint
ensures logical coherence between forward and reverse transitions. To advance
CPR research, we construct and release two dedicated benchmarks, AIST-CPR and
PoseFixCPR, supplementing prior datasets with enhanced attributes. Extensive
experiments demonstrate that training retrieval models with AutoComPose yields
superior performance over human-annotated and heuristic-based methods,
significantly reducing annotation costs while improving retrieval quality. Our
work pioneers the automatic annotation of pose transitions, establishing a
scalable foundation for future CPR research.
comment: ICCV 2025
♻ ☆ Fully Automated Segmentation of Fiber Bundles in Anatomic Tracing Data MICCAI 2025
Kyriaki-Margarita Bintsi, Yaël Balbastre, Jingjing Wu, Julia F. Lehman, Suzanne N. Haber, Anastasia Yendiki
Anatomic tracer studies are critical for validating and improving diffusion
MRI (dMRI) tractography. However, large-scale analysis of data from such
studies is hampered by the labor-intensive process of annotating fiber bundles
manually on histological slides. Existing automated methods often miss sparse
bundles or require complex post-processing across consecutive sections,
limiting their flexibility and generalizability. We present a streamlined,
fully automated framework for fiber bundle segmentation in macaque tracer data,
based on a U-Net architecture with large patch sizes, foreground aware
sampling, and semisupervised pre-training. Our approach eliminates common
errors such as mislabeling terminals as bundles, improves detection of sparse
bundles by over 20% and reduces the False Discovery Rate (FDR) by 40% compared
to the state-of-the-art, all while enabling analysis of standalone slices. This
new framework will facilitate the automated analysis of anatomic tracing data
at a large scale, generating more ground-truth data that can be used to
validate and optimize dMRI tractography methods.
comment: Accepted at CDMRI, MICCAI 2025
♻ ☆ HouseCrafter: Lifting Floorplans to 3D Scenes with 2D Diffusion Model
We introduce HouseCrafter, a novel approach that can lift a floorplan into a
complete large 3D indoor scene (e.g., a house). Our key insight is to adapt a
2D diffusion model, which is trained on web-scale images, to generate
consistent multi-view color (RGB) and depth (D) images across different
locations of the scene. Specifically, the RGB-D images are generated
autoregressively in a batch-wise manner along sampled locations based on the
floorplan, where previously generated images are used as condition to the
diffusion model to produce images at nearby locations. The global floorplan and
attention design in the diffusion model ensures the consistency of the
generated images, from which a 3D scene can be reconstructed. Through extensive
evaluation on the 3D-Front dataset, we demonstrate that HouseCraft can generate
high-quality house-scale 3D scenes. Ablation studies also validate the
effectiveness of different design choices. We will release our code and model
weights. Project page: https://neu-vi.github.io/houseCrafter/
♻ ☆ UltraDfeGAN: Detail-Enhancing Generative Adversarial Networks for High-Fidelity Functional Ultrasound Synthesis
Functional ultrasound (fUS) is a neuroimaging technique known for its high
spatiotemporal resolution, enabling non-invasive observation of brain activity
through neurovascular coupling. Despite its potential in clinical applications
such as neonatal monitoring and intraoperative guidance, the development of fUS
faces challenges related to data scarcity and limitations in generating
realistic fUS images. This paper explores the use of a generative adversarial
network (GAN) framework tailored for fUS image synthesis. The proposed method
incorporates architectural enhancements, including feature enhancement modules
and normalization techniques, aiming to improve the fidelity and physiological
plausibility of generated images. The study evaluates the performance of the
framework against existing generative models, demonstrating its capability to
produce high-quality fUS images under various experimental conditions.
Additionally, the synthesized images are assessed for their utility in
downstream tasks, showing improvements in classification accuracy when used for
data augmentation. Experimental results are based on publicly available fUS
datasets, highlighting the framework's effectiveness in addressing data
limitations.
♻ ☆ Vision Backbone Efficient Selection for Image Classification in Low-Data Regimes BMVC 2025
Transfer learning has become an essential tool in modern computer vision,
allowing practitioners to leverage backbones, pretrained on large datasets, to
train successful models from limited annotated data. Choosing the right
backbone is crucial, especially for small datasets, since final performance
depends heavily on the quality of the initial feature representations. While
prior work has conducted benchmarks across various datasets to identify
universal top-performing backbones, we demonstrate that backbone effectiveness
is highly dataset-dependent, especially in low-data scenarios where no single
backbone consistently excels. To overcome this limitation, we introduce
dataset-specific backbone selection as a new research direction and investigate
its practical viability in low-data regimes. Since exhaustive evaluation is
computationally impractical for large backbone pools, we formalize Vision
Backbone Efficient Selection (VIBES) as the problem of searching for
high-performing backbones under computational constraints. We define the
solution space, propose several heuristics, and demonstrate VIBES feasibility
for low-data image classification by performing experiments on four diverse
datasets. Our results show that even simple search strategies can find
well-suited backbones within a pool of over $1300$ pretrained models,
outperforming generic benchmark recommendations within just ten minutes of
search time on a single GPU (NVIDIA RTX A5000).
comment: 16 pages, 8 figures, Accepted at BMVC 2025
♻ ☆ Blending 3D Geometry and Machine Learning for Multi-View Stereopsis
Traditional multi-view stereo (MVS) methods primarily depend on photometric
and geometric consistency constraints. In contrast, modern learning-based
algorithms often rely on the plane sweep algorithm to infer 3D geometry,
applying explicit geometric consistency (GC) checks only as a post-processing
step, with no impact on the learning process itself. In this work, we introduce
GC MVSNet plus plus, a novel approach that actively enforces geometric
consistency of reference view depth maps across multiple source views (multi
view) and at various scales (multi scale) during the learning phase (see Fig.
1). This integrated GC check significantly accelerates the learning process by
directly penalizing geometrically inconsistent pixels, effectively halving the
number of training iterations compared to other MVS methods. Furthermore, we
introduce a densely connected cost regularization network with two distinct
block designs simple and feature dense optimized to harness dense feature
connections for enhanced regularization. Extensive experiments demonstrate that
our approach achieves a new state of the art on the DTU and BlendedMVS datasets
and secures second place on the Tanks and Temples benchmark. To our knowledge,
GC MVSNet plus plus is the first method to enforce multi-view, multi-scale
supervised geometric consistency during learning. Our code is available.
comment: A pre-print -- accepted at Neurocomputing. arXiv admin note:
substantial text overlap with arXiv:2310.19583
♻ ☆ Vector-Quantized Vision Foundation Models for Object-Centric Learning ACM MM 2025
Object-Centric Learning (OCL) aggregates image or video feature maps into
object-level feature vectors, termed \textit{slots}. It's self-supervision of
reconstructing the input from slots struggles with complex object textures,
thus Vision Foundation Model (VFM) representations are used as the aggregation
input and reconstruction target. Existing methods leverage VFM representations
in diverse ways yet fail to fully exploit their potential. In response, we
propose a unified architecture, Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or
VVO). The key to our unification is simply shared quantizing VFM
representations in OCL aggregation and decoding. Experiments show that across
different VFMs, aggregators and decoders, our VVO consistently outperforms
baselines in object discovery and recognition, as well as downstream visual
prediction and reasoning. We also mathematically analyze why VFM
representations facilitate OCL aggregation and why their shared quantization as
reconstruction targets strengthens OCL supervision. Our source code and model
checkpoints are available on https://github.com/Genera1Z/VQ-VFM-OCL.
comment: Accepted by ACM MM 2025
♻ ☆ Vehicle detection from GSV imagery: Predicting travel behaviour for cycling and motorcycling using Computer Vision
Kyriaki, Kokka, Rahul Goel, Ali Abbas, Kerry A. Nice, Luca Martial, SM Labib, Rihuan Ke, Carola Bibiane Schönlieb, James Woodcock
Transportation influence health by shaping exposure to physical activity, air
pollution and injury risk. Comparative data on cycling and motorcycling
behaviours is scarce, particularly at a global scale. Street view imagery, such
as Google Street View (GSV), combined with computer vision, is a valuable
resource for efficiently capturing travel behaviour data. This study
demonstrates a novel approach using deep learning on street view images to
estimate cycling and motorcycling levels across diverse cities worldwide. We
utilized data from 185 global cities. The data on mode shares of cycling and
motorcycling estimated using travel surveys or censuses. We used GSV images to
detect cycles and motorcycles in sampled locations, using 8000 images per city.
The YOLOv4 model, fine-tuned using images from six cities, achieved a mean
average precision of 89% for detecting cycles and motorcycles. A global
prediction model was developed using beta regression with city-level mode
shares as outcome, with log transformed explanatory variables of counts of
GSV-detected images with cycles and motorcycles, while controlling for
population density. We found strong correlations between GSV motorcycle counts
and motorcycle mode share (0.78) and moderate correlations between GSV cycle
counts and cycling mode share (0.51). Beta regression models predicted mode
shares with $R^2$ values of 0.614 for cycling and 0.612 for motorcycling,
achieving median absolute errors (MDAE) of 1.3% and 1.4%, respectively.
Scatterplots demonstrated consistent prediction accuracy, though cities like
Utrecht and Cali were outliers. The model was applied to 60 cities globally for
which we didn't have recent mode share data. We provided estimates for some
cities in the Middle East, Latin America and East Asia. With computer vision,
GSV images capture travel modes and activity, providing insights alongside
traditional data sources.
♻ ☆ MMHMER:Multi-viewer and Multi-task for Handwritten Mathematical Expression Recognition
Handwritten Mathematical Expression Recognition (HMER) methods have made
remarkable progress, with most existing HMER approaches based on either a
hybrid CNN/RNN-based with GRU architecture or Transformer architectures. Each
of these has its strengths and weaknesses. Leveraging different model
structures as viewers and effectively integrating their diverse capabilities
presents an intriguing avenue for exploration. This involves addressing two key
challenges: 1) How to fuse these two methods effectively, and 2) How to achieve
higher performance under an appropriate level of complexity. This paper
proposes an efficient CNN-Transformer multi-viewer, multi-task approach to
enhance the model's recognition performance. Our MMHMER model achieves 63.96%,
62.51%, and 65.46% ExpRate on CROHME14, CROHME16, and CROHME19, outperforming
Posformer with an absolute gain of 1.28%, 1.48%, and 0.58%. The main
contribution of our approach is that we propose a new multi-view, multi-task
framework that can effectively integrate the strengths of CNN and Transformer.
By leveraging the feature extraction capabilities of CNN and the sequence
modeling capabilities of Transformer, our model can better handle the
complexity of handwritten mathematical expressions.
comment: 7 pages;2 figures
♻ ☆ SlotMatch: Distilling Temporally Consistent Object-Centric Representations for Unsupervised Video Segmentation
Unsupervised video segmentation is a challenging computer vision task,
especially due to the lack of supervisory signals coupled with the complexity
of visual scenes. To overcome this challenge, state-of-the-art models based on
slot attention often have to rely on large and computationally expensive neural
architectures. To this end, we propose a simple knowledge distillation
framework that effectively transfers object-centric representations to a
lightweight student. The proposed framework, called SlotMatch, aligns
corresponding teacher and student slots via the cosine similarity, requiring no
additional distillation objectives or auxiliary supervision. The simplicity of
SlotMatch is confirmed via theoretical and empirical evidence, both indicating
that integrating additional losses is redundant. We conduct experiments on two
datasets to compare the state-of-the-art teacher model, SlotContrast, with our
distilled student. The results show that our student based on SlotMatch matches
and even outperforms its teacher, while using 3.6x less parameters and running
1.9x faster. Moreover, our student surpasses previous unsupervised video
segmentation models.
♻ ☆ LEGO: Learning and Graph-Optimized Modular Tracker for Online Multi-Object Tracking with Point Clouds
Online multi-object tracking (MOT) plays a pivotal role in autonomous
systems. The state-of-the-art approaches usually employ a tracking-by-detection
method, and data association plays a critical role. This paper proposes a
learning and graph-optimized (LEGO) modular tracker to improve data association
performance in the existing literature. The proposed LEGO tracker integrates
graph optimization and self-attention mechanisms, which efficiently formulate
the association score map, facilitating the accurate and efficient matching of
objects across time frames. To further enhance the state update process, the
Kalman filter is added to ensure consistent tracking by incorporating temporal
coherence in the object states. Our proposed method utilizing LiDAR alone has
shown exceptional performance compared to other online tracking approaches,
including LiDAR-based and LiDAR-camera fusion-based methods. LEGO ranked 1st at
the time of submitting results to KITTI object tracking evaluation ranking
board and remains 2nd at the time of submitting this paper, among all online
trackers in the KITTI MOT benchmark for cars1
♻ ☆ Disentangled Representation Learning with the Gromov-Monge Gap ICLR 2025
Learning disentangled representations from unlabelled data is a fundamental
challenge in machine learning. Solving it may unlock other problems, such as
generalization, interpretability, or fairness. Although remarkably challenging
to solve in theory, disentanglement is often achieved in practice through prior
matching. Furthermore, recent works have shown that prior matching approaches
can be enhanced by leveraging geometrical considerations, e.g., by learning
representations that preserve geometric features of the data, such as distances
or angles between points. However, matching the prior while preserving
geometric features is challenging, as a mapping that fully preserves these
features while aligning the data distribution with the prior does not exist in
general. To address these challenges, we introduce a novel approach to
disentangled representation learning based on quadratic optimal transport. We
formulate the problem using Gromov-Monge maps that transport one distribution
onto another with minimal distortion of predefined geometric features,
preserving them as much as can be achieved. To compute such maps, we propose
the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a
reference distribution with minimal geometry distortion. We demonstrate the
effectiveness of our approach for disentanglement across four standard
benchmarks, outperforming other methods leveraging geometric considerations.
comment: ICLR 2025
♻ ☆ Rethinking Weight-Averaged Model-merging
Model merging, particularly through weight averaging, has shown surprising
effectiveness in saving computations and improving model performance without
any additional training. However, the interpretability of why and how this
technique works remains unclear. In this work, we reinterpret weight-averaged
model merging through the lens of interpretability and provide empirical
insights into the underlying mechanisms that govern its behavior. We approach
the problem from three perspectives: (1) we analyze the learned weight
structures and demonstrate that model weights encode structured representations
that help explain the compatibility of weight averaging; (2) we compare
averaging in weight space and feature space across diverse model architectures
(CNNs and ViTs) and datasets, aiming to expose under which circumstances what
combination paradigm will work more effectively; (3) we study the effect of
parameter scaling on prediction stability, highlighting how weight averaging
acts as a form of regularization that contributes to robustness. By framing
these analyses in an interpretability context, our work contributes to a more
transparent and systematic understanding of model merging for stakeholders
interested in the safety and reliability of untrained model combination
methods. The code is available at https://github.com/billhhh/Rethink-Merge.
♻ ☆ SBP-YOLO:A Lightweight Real-Time Model for Detecting Speed Bumps and Potholes
Reliable and real-time detection of road speed bumps and potholes is crucial
for anticipatory perception in advanced suspension systems, enabling timely and
adaptive damping control. Achieving high accuracy and efficiency on embedded
platforms remains challenging due to limited computational resources and the
small scale of distant targets. This paper presents SBP-YOLO, a lightweight and
high-speed detection framework tailored for bump and pothole recognition. Based
on YOLOv11n, the model integrates GhostConv and VoVGSCSPC modules into the
backbone and neck to reduce computation while enhancing multi-scale semantic
features. To improve small-object detection, a P2-level branch is introduced
with a lightweight and efficient detection head LEDH mitigating the added
computational overhead without compromising accuracy. A hybrid training
strategy combining NWD loss, backbone-level knowledge distillation, and
Albumentations-driven augmentation further enhances localization precision and
robustness. Experiments show that SBP-YOLO achieves 87.0 percent mAP,
outperforming the YOLOv11n baseline by 5.8 percent. After TensorRT FP16
quantization, it runs at 139.5 FPS on Jetson AGX Xavier, delivering a 12.4
percent speedup over the P2-enhanced YOLOv11. These results validate the
effectiveness of the proposed method for fast and low-latency road condition
perception in embedded suspension control systems.
comment: 14pages,10figures
♻ ☆ EmoSEM: Segment and Explain Emotion Stimuli in Visual Art
This paper focuses on a key challenge in visual emotion understanding: given
an art image, the model pinpoints pixel regions that trigger a specific human
emotion, and generates linguistic explanations for it. Despite advances in
general segmentation, pixel-level emotion understanding still faces a dual
challenge: first, the subjectivity of emotion limits general segmentation
models like SAM to adapt to emotion-oriented segmentation tasks; and second,
the abstract nature of art expression makes it hard for captioning models to
balance pixel-level semantics and emotion reasoning. To solve the above
problems, this paper proposes the Emotion stimuli Segmentation and Explanation
Model (EmoSEM) model to endow the segmentation framework with emotion
comprehension capability. First, to enable the model to perform segmentation
under the guidance of emotional intent well, we introduce an emotional prompt
with a learnable mask token as the conditional input for segmentation decoding.
Then, we design an emotion projector to establish the association between
emotion and visual features. Next, more importantly, to address emotion-visual
stimuli alignment, we develop a lightweight prefix adapter, a module that fuses
the learned emotional mask with the corresponding emotion into a unified
representation compatible with the language model. Finally, we input the joint
visual, mask, and emotional tokens into the language model and output the
emotional explanations. It ensures that the generated interpretations remain
semantically and emotionally coherent with the visual stimuli. Our method
realizes end-to-end modeling from low-level pixel features to high-level
emotion interpretation, delivering the first interpretable fine-grained
framework for visual emotion analysis. Extensive experiments validate the
effectiveness of our model. Code will be made publicly available.
♻ ☆ BRISC: Annotated Dataset for Brain Tumor Segmentation and Classification with Swin-HAFNet
Accurate segmentation and classification of brain tumors from Magnetic
Resonance Imaging (MRI) remain key challenges in medical image analysis. This
is primarily due to the lack of high-quality, balanced, and diverse datasets.
In this work, we present a newly developed MRI dataset named BRISC designed
specifically for brain tumor segmentation and classification tasks. The dataset
comprises 6,000 contrast-enhanced T1-weighted MRI scans annotated by certified
radiologists and physicians. It includes three major tumor types, namely
glioma, meningioma, and pituitary, as well as non-tumorous cases. Each sample
includes high-resolution labels and is categorized across axial, sagittal, and
coronal imaging planes to facilitate robust model development and cross-view
generalization. To demonstrate the utility of the dataset, we propose a
transformer-based segmentation model and benchmark it against established
baselines. In this work, we propose a transformer-based model designed for both
segmentation and classification of brain tumors, leveraging multi-scale feature
representations from a Swin Transformer backbone. The model is benchmarked
against established baselines to demonstrate the utility of the dataset,
enabling accurate segmentation and robust classification across four diagnostic
categories: glioma, meningioma, pituitary, and non-tumorous cases. In this
work, our proposed transformer-based model demonstrates superior performance in
both segmentation and classification tasks for brain tumor analysis. For the
segmentation task, the method achieves the highest weighted mean
Intersection-over-Union (IoU) of 82.3\%, with improvements observed across all
tumor categories. For the classification task, the model attains an accuracy of
99.63\%, effectively distinguishing between glioma, meningioma, pituitary, and
non-tumorous cases. https://www.kaggle.com/datasets/briscdataset/brisc2025/
♻ ☆ MA-CBP: A Criminal Behavior Prediction Framework Based on Multi-Agent Asynchronous Collaboration
Cheng Liu, Daou Zhang, Tingxu Liu, Yuhan Wang, Jinyang Chen, Yuexuan Li, Xinying Xiao, Chenbo Xin, Ziru Wang, Weichao Wu
With the acceleration of urbanization, criminal behavior in public scenes
poses an increasingly serious threat to social security. Traditional anomaly
detection methods based on feature recognition struggle to capture high-level
behavioral semantics from historical information, while generative approaches
based on Large Language Models (LLMs) often fail to meet real-time
requirements. To address these challenges, we propose MA-CBP, a criminal
behavior prediction framework based on multi-agent asynchronous collaboration.
This framework transforms real-time video streams into frame-level semantic
descriptions, constructs causally consistent historical summaries, and fuses
adjacent image frames to perform joint reasoning over long- and short-term
contexts. The resulting behavioral decisions include key elements such as event
subjects, locations, and causes, enabling early warning of potential criminal
activity. In addition, we construct a high-quality criminal behavior dataset
that provides multi-scale language supervision, including frame-level,
summary-level, and event-level semantic annotations. Experimental results
demonstrate that our method achieves superior performance on multiple datasets
and offers a promising solution for risk warning in urban public safety
scenarios.
♻ ☆ Enhancing Cost Efficiency in Active Learning with Candidate Set Query
This paper introduces a cost-efficient active learning (AL) framework for
classification, featuring a novel query design called candidate set query.
Unlike traditional AL queries requiring the oracle to examine all possible
classes, our method narrows down the set of candidate classes likely to include
the ground-truth class, significantly reducing the search space and labeling
cost. Moreover, we leverage conformal prediction to dynamically generate small
yet reliable candidate sets, adapting to model enhancement over successive AL
rounds. To this end, we introduce an acquisition function designed to
prioritize data points that offer high information gain at lower cost.
Empirical evaluations on CIFAR-10, CIFAR-100, and ImageNet64x64 demonstrate the
effectiveness and scalability of our framework. Notably, it reduces labeling
cost by 48% on ImageNet64x64. The project page can be found at
https://yehogwon.github.io/csq-al.
comment: Accepted to TMLR
♻ ☆ RAPNet: A Receptive-Field Adaptive Convolutional Neural Network for Pansharpening
Pansharpening refers to the process of integrating a high resolution
panchromatic (PAN) image with a lower resolution multispectral (MS) image to
generate a fused product, which is pivotal in remote sensing. Despite the
effectiveness of CNNs in addressing this challenge, they are inherently
constrained by the uniform application of convolutional kernels across all
spatial positions, overlooking local content variations. To overcome this
issue, we introduce RAPNet, a new architecture that leverages content-adaptive
convolution. At its core, RAPNet employs the Receptive-field Adaptive
Pansharpening Convolution (RAPConv), designed to produce spatially adaptive
kernels responsive to local feature context, thereby enhancing the precision of
spatial detail extraction. Additionally, the network integrates the
Pansharpening Dynamic Feature Fusion (PAN-DFF) module, which incorporates an
attention mechanism to achieve an optimal balance between spatial detail
enhancement and spectral fidelity. Comprehensive evaluations on publicly
available datasets confirm that RAPNet delivers superior performance compared
to existing approaches, as demonstrated by both quantitative metrics and
qualitative assessments. Ablation analyses further substantiate the
effectiveness of the proposed adaptive components.
comment: Accepted by the 6th International Conference on Artificial
Intelligence and Electromechanical Automation (AIEA 2025). 5 pages, 6 figures
♻ ☆ ResFlow: Fine-tuning Residual Optical Flow for Event-based High Temporal Resolution Motion Estimation
Event cameras hold significant promise for high-temporal-resolution (HTR)
motion estimation. However, estimating event-based HTR optical flow faces two
key challenges: the absence of HTR ground-truth data and the intrinsic sparsity
of event data. Most existing approaches rely on the flow accumulation paradigms
to indirectly supervise intermediate flows, often resulting in accumulation
errors and optimization difficulties. To address these challenges, we propose a
residual-based paradigm for estimating HTR optical flow with event data. Our
approach separates HTR flow estimation into two stages: global linear motion
estimation and HTR residual flow refinement. The residual paradigm effectively
mitigates the impacts of event sparsity on optimization and is compatible with
any LTR algorithm. Next, to address the challenge posed by the absence of HTR
ground truth, we incorporate novel learning strategies. Specifically, we
initially employ a shared refiner to estimate the residual flows, enabling both
LTR supervision and HTR inference. Subsequently, we introduce regional noise to
simulate the residual patterns of intermediate flows, facilitating the
adaptation from LTR supervision to HTR inference. Additionally, we show that
the noise-based strategy supports in-domain self-supervised training.
Comprehensive experimental results demonstrate that our approach achieves
state-of-the-art accuracy in both LTR and HTR metrics, highlighting its
effectiveness and superiority.
comment: 12 pages, 9 figures
♻ ☆ Spatially-guided Temporal Aggregation for Robust Event-RGB Optical Flow Estimation
Current optical flow methods exploit the stable appearance of frame (or RGB)
data to establish robust correspondences across time. Event cameras, on the
other hand, provide high-temporal-resolution motion cues and excel in
challenging scenarios. These complementary characteristics underscore the
potential of integrating frame and event data for optical flow estimation.
However, most cross-modal approaches fail to fully utilize the complementary
advantages, relying instead on simply stacking information. This study
introduces a novel approach that uses a spatially dense modality to guide the
aggregation of the temporally dense event modality, achieving effective
cross-modal fusion. Specifically, we propose an event-enhanced frame
representation that preserves the rich texture of frames and the basic
structure of events. We use the enhanced representation as the guiding modality
and employ events to capture temporally dense motion information. The robust
motion features derived from the guiding modality direct the aggregation of
motion information from events. To further enhance fusion, we propose a
transformer-based module that complements sparse event motion features with
spatially rich frame information and enhances global information propagation.
Additionally, a mix-fusion encoder is designed to extract comprehensive
spatiotemporal contextual features from both modalities. Extensive experiments
on the MVSEC and DSEC-Flow datasets demonstrate the effectiveness of our
framework. Leveraging the complementary strengths of frames and events, our
method achieves leading performance on the DSEC-Flow dataset. Compared to the
event-only model, frame guidance improves accuracy by 10\%. Furthermore, it
outperforms the state-of-the-art fusion-based method with a 4\% accuracy gain
and a 45\% reduction in inference time.
comment: 11 pages, 8 figures, under review
♻ ☆ Regional quality estimation for echocardiography using deep learning
Gilles Van De Vyver, Svein-Erik Måsøy, Håvard Dalen, Bjørnar Leangen Grenne, Espen Holte, Sindre Hellum Olaisen, John Nyberg, Andreas Østvik, Lasse Løvstakken, Erik Smistad
Automatic estimation of cardiac ultrasound image quality can be beneficial
for guiding operators and ensuring the accuracy of clinical measurements.
Previous work often fails to distinguish the view correctness of the
echocardiogram from the image quality. Additionally, previous studies only
provide a global image quality value, which limits their practical utility. In
this work, we developed and compared three methods to estimate image quality:
1) classic pixel-based metrics like the generalized contrast-to-noise ratio
(gCNR) on myocardial segments as region of interest and left ventricle lumen as
background, obtained using a U-Net segmentation 2) local image coherence
derived from a U-Net model that predicts coherence from B-Mode images 3) a deep
convolutional network that predicts the quality of each region directly in an
end-to-end fashion. We evaluate each method against manual regional image
quality annotations by three experienced cardiologists. The results indicate
poor performance of the gCNR metric, with Spearman correlation to the
annotations of rho = 0.24. The end-to-end learning model obtains the best
result, rho = 0.69, comparable to the inter-observer correlation, rho = 0.63.
Finally, the coherence-based method, with rho = 0.58, outperformed the
classical metrics and is more generic than the end-to-end approach. The image
quality prediction tool is available as an open source Python library at
https://github.com/GillesVanDeVyver/arqee.
♻ ☆ A Versatile Pathology Co-pilot via Reasoning Enhanced Multimodal Large Language Model
Zhe Xu, Ziyi Liu, Junlin Hou, Jiabo Ma, Cheng Jin, Yihui Wang, Zhixuan Chen, Zhengyu Zhang, Fuxiang Huang, Zhengrui Guo, Fengtao Zhou, Yingxue Xu, Xi Wang, Ronald Cheong Kin Chan, Li Liang, Hao Chen
Multimodal large language models (MLLMs) have emerged as powerful tools for
computational pathology, offering unprecedented opportunities to integrate
pathological images with language context for comprehensive diagnostic
analysis. These models hold particular promise for automating complex tasks
that traditionally require expert interpretation of pathologists. However,
current MLLM approaches in pathology demonstrate significantly constrained
reasoning capabilities, primarily due to their reliance on expensive
chain-of-thought annotations. Additionally, existing methods remain limited to
simplex application of visual question answering (VQA) at the
region-of-interest (ROI) level, failing to address the full spectrum of
diagnostic needs such as ROI classification, detection, segmentation,
whole-slide-image (WSI) classification and VQA in clinical practice. In this
study, we present SmartPath-R1, a versatile MLLM capable of simultaneously
addressing both ROI-level and WSI-level tasks while demonstrating robust
pathological reasoning capability. Our framework combines scale-dependent
supervised fine-tuning and task-aware reinforcement fine-tuning, which
circumvents the requirement for chain-of-thought supervision by leveraging the
intrinsic knowledge within MLLM. Furthermore, SmartPath-R1 integrates
multiscale and multitask analysis through a mixture-of-experts mechanism,
enabling dynamic processing for diverse tasks. We curate a large-scale dataset
comprising 2.3M ROI samples and 188K WSI samples for training and evaluation.
Extensive experiments across 72 tasks validate the effectiveness and
superiority of the proposed approach. This work represents a significant step
toward developing versatile, reasoning-enhanced AI systems for precision
pathology.
♻ ☆ Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
Reference Expression Segmentation (RES) aims to segment image regions
specified by referring expressions and has become popular with the rise of
multimodal large models (MLLMs). While MLLMs excel in semantic understanding,
their token-generation paradigm struggles with pixel-level dense prediction.
Existing RES methods either couple MLLMs with the parameter-heavy Segment
Anything Model (SAM) with 632M network parameters or adopt SAM-free lightweight
pipelines that sacrifice accuracy. To address the trade-off between performance
and cost, we specifically propose MLLMSeg, a novel framework that fully
exploits the inherent visual detail features encoded in the MLLM vision encoder
without introducing an extra visual encoder. Besides, we propose a
detail-enhanced and semantic-consistent feature fusion module (DSFF) that fully
integrates the detail-related visual feature with the semantic-related feature
output by the large language model (LLM) of MLLM. Finally, we establish a
light-weight mask decoder with only 34M network parameters that optimally
leverages detailed spatial features from the visual encoder and semantic
features from the LLM to achieve precise mask prediction. Extensive experiments
demonstrate that our method generally surpasses both SAM-based and SAM-free
competitors, striking a better balance between performance and cost. Code is
available at https://github.com/jcwang0602/MLLMSeg.
comment: 9 pages, 4 figures
♻ ☆ Identify, Isolate, and Purge: Mitigating Hallucinations in LVLMs via Self-Evolving Distillation
Large Vision-Language Models (LVLMs) have demonstrated remarkable
advancements in numerous areas such as multimedia. However, hallucination
issues significantly limit their credibility and application potential.
Existing mitigation methods typically rely on external tools or the comparison
of multi-round inference, which significantly increase inference time. In this
paper, we propose \textbf{SE}lf-\textbf{E}volving \textbf{D}istillation
(\textbf{SEED}), which identifies hallucinations within the inner knowledge of
LVLMs, isolates and purges them, and then distills the purified knowledge back
into the model, enabling self-evolution. Furthermore, we identified that
traditional distillation methods are prone to inducing void spaces in the
output space of LVLMs. To address this issue, we propose a Mode-Seeking
Evolving approach, which performs distillation to capture the dominant modes of
the purified knowledge distribution, thereby avoiding the chaotic results that
could emerge from void spaces. Moreover, we introduce a Hallucination
Elimination Adapter, which corrects the dark knowledge of the original model by
learning purified knowledge. Extensive experiments on multiple benchmarks
validate the superiority of our SEED, demonstrating substantial improvements in
mitigating hallucinations for representative LVLM models such as LLaVA-1.5 and
InternVL2. Remarkably, the F1 score of LLaVA-1.5 on the hallucination
evaluation metric POPE-Random improved from 81.3 to 88.3.
comment: In Figure 2, the correlation coefficient and the scatter plot do not
match. I calculated this correlation using two sets of settings. I used the
scatter plot from setting A, but accidentally wrote the correlation
coefficient, r, from setting B
♻ ☆ ContrastAlign: Toward Robust BEV Feature Alignment via Contrastive Learning for Multi-Modal 3D Object Detection
Ziying Song, Hongyu Pan, Feiyang Jia, Yongchang Zhang, Lin Liu, Lei Yang, Shaoqing Xu, Peiliang Wu, Caiyan Jia, Zheng Zhang, Yadan Luo
In the field of 3D object detection tasks, fusing heterogeneous features from
LiDAR and camera sensors into a unified Bird's Eye View (BEV) representation is
a widely adopted paradigm. However, existing methods often suffer from
imprecise sensor calibration, leading to feature misalignment in LiDAR-camera
BEV fusion. Moreover, such inaccuracies cause errors in depth estimation for
the camera branch, aggravating misalignment between LiDAR and camera BEV
features. In this work, we propose a novel ContrastAlign approach that utilizes
contrastive learning to enhance the alignment of heterogeneous modalities,
thereby improving the robustness of the fusion process. Specifically, our
approach comprises three key components: (1) the L-Instance module, which
extracts LiDAR instance features within the LiDAR BEV features; (2) the
C-Instance module, which predicts camera instance features through Region of
Interest (RoI) pooling on the camera BEV features; (3) the InstanceFusion
module, which employs contrastive learning to generate consistent instance
features across heFterogeneous modalities. Subsequently, we use graph matching
to calculate the similarity between the neighboring camera instance features
and the similarity instance features to complete the alignment of instance
features. Our method achieves SOTA performance, with an mAP of 71.5%,
surpassing GraphBEV by 1.4% on the nuScenes val set. Importantly, our method
excels BEVFusion under conditions with spatial & temporal misalignment noise,
improving mAP by 1.4% and 11.1% on nuScenes dataset. Notably, on the Argoverse2
dataset, ContrastAlign outperforms GraphBEV by 1.0% in mAP, indicating that the
farther the distance, the more severe the feature misalignment and the more
effective.
comment: 12 pages, 3 figures
♻ ☆ Automatic Image Colorization with Convolutional Neural Networks and Generative Adversarial Networks
Image colorization, the task of adding colors to grayscale images, has been
the focus of significant research efforts in computer vision in recent years
for its various application areas such as color restoration and automatic
animation colorization [15, 1]. The colorization problem is challenging as it
is highly ill-posed with two out of three image dimensions lost, resulting in
large degrees of freedom. However, semantics of the scene as well as the
surface texture could provide important cues for colors: the sky is typically
blue, the clouds are typically white and the grass is typically green, and
there are huge amounts of training data available for learning such priors
since any colored image could serve as a training data point [20].
Colorization is initially formulated as a regression task[5], which ignores
the multi-modal nature of color prediction. In this project, we explore
automatic image colorization via classification and adversarial learning. We
will build our models on prior works, apply modifications for our specific
scenario and make comparisons.
comment: All authors have equal authorship and equal contribution, ranked in
alphabetic order. First version of this paper was completed and published in
2021
♻ ☆ Rethinking Transformer-Based Blind-Spot Network for Self-Supervised Image Denoising AAAI 2025
Blind-spot networks (BSN) have been prevalent neural architectures in
self-supervised image denoising (SSID). However, most existing BSNs are
conducted with convolution layers. Although transformers have shown the
potential to overcome the limitations of convolutions in many image restoration
tasks, the attention mechanisms may violate the blind-spot requirement, thereby
restricting their applicability in BSN. To this end, we propose to analyze and
redesign the channel and spatial attentions to meet the blind-spot requirement.
Specifically, channel self-attention may leak the blind-spot information in
multi-scale architectures, since the downsampling shuffles the spatial feature
into channel dimensions. To alleviate this problem, we divide the channel into
several groups and perform channel attention separately. For spatial
selfattention, we apply an elaborate mask to the attention matrix to restrict
and mimic the receptive field of dilated convolution. Based on the redesigned
channel and window attentions, we build a Transformer-based Blind-Spot Network
(TBSN), which shows strong local fitting and global perspective abilities.
Furthermore, we introduce a knowledge distillation strategy that distills TBSN
into smaller denoisers to improve computational efficiency while maintaining
performance. Extensive experiments on real-world image denoising datasets show
that TBSN largely extends the receptive field and exhibits favorable
performance against state-of-theart SSID methods.
comment: AAAI 2025 Camera Ready, update Fig.4
♻ ☆ WIPES: Wavelet-based Visual Primitives
Pursuing a continuous visual representation that offers flexible frequency
modulation and fast rendering speed has recently garnered increasing attention
in the fields of 3D vision and graphics. However, existing representations
often rely on frequency guidance or complex neural network decoding, leading to
spectrum loss or slow rendering. To address these limitations, we propose
WIPES, a universal Wavelet-based vIsual PrimitivES for representing
multi-dimensional visual signals. Building on the spatial-frequency
localization advantages of wavelets, WIPES effectively captures both the
low-frequency "forest" and the high-frequency "trees." Additionally, we develop
a wavelet-based differentiable rasterizer to achieve fast visual rendering.
Experimental results on various visual tasks, including 2D image
representation, 5D static and 6D dynamic novel view synthesis, demonstrate that
WIPES, as a visual primitive, offers higher rendering quality and faster
inference than INR-based methods, and outperforms Gaussian-based
representations in rendering quality.
comment: IEEE/CVF International Conference on Computer Vision 2025
♻ ☆ Stereo-based 3D Anomaly Object Detection for Autonomous Driving: A New Dataset and Baseline
3D detection technology is widely used in the field of autonomous driving,
with its application scenarios gradually expanding from enclosed highways to
open conventional roads. For rare anomaly categories that appear on the road,
3D detection models trained on closed sets often misdetect or fail to detect
anomaly objects. To address this risk, it is necessary to enhance the
generalization ability of 3D detection models for targets of arbitrary shapes
and to possess the capability to filter out anomalies. The generalization of 3D
detection is limited by two factors: the coupled training of 2D and 3D, and the
insufficient diversity in the scale distribution of training samples. This
paper proposes a Stereo-based 3D Anomaly object Detection (S3AD) algorithm,
which decouples the training strategy of 3D and 2D to release the
generalization ability for arbitrary 3D foreground detection, and proposes an
anomaly scoring algorithm based on foreground confidence prediction, achieving
target-level anomaly scoring. In order to further verify and enhance the
generalization of anomaly detection, we use a 3D rendering method to synthesize
two augmented reality binocular stereo 3D detection datasets which named
KITTI-AR. KITTI-AR extends upon KITTI by adding 97 new categories, totaling 6k
pairs of stereo images. The KITTI-AR-ExD subset includes 39 common categories
as extra training data to address the sparse sample distribution issue.
Additionally, 58 rare categories form the KITTI-AR-OoD subset, which are not
used in training to simulate zero-shot scenarios in real-world settings, solely
for evaluating 3D anomaly detection. Finally, the performance of the algorithm
and the dataset is verified in the experiments. (Code and dataset can be
obtained at https://github.com/shiyi-mu/S3AD-Code).
comment: under review
♻ ☆ Rapid Urban Visibility Hotspots: Quantifying Building Vertex Visibility from Connected Vehicle Trajectories using Spatial Indexing
Effective placement of Out-of-Home advertising and street furniture requires
accurate identification of locations offering maximum visual exposure to target
audiences, particularly vehicular traffic. Traditional site selection methods
often rely on static traffic counts or subjective assessments. This research
introduces a data-driven methodology to objectively quantify location
visibility by analyzing large-scale connected vehicle trajectory data (sourced
from Compass IoT) within urban environments. We model the dynamic driver
field-of-view using a forward-projected visibility area for each vehicle
position derived from interpolated trajectories. By integrating this with
building vertex locations extracted from OpenStreetMap, we quantify the
cumulative visual exposure, or ``visibility count'', for thousands of potential
points of interest near roadways. The analysis reveals that visibility is
highly concentrated, identifying specific ``visual hotspots'' that receive
disproportionately high exposure compared to average locations. The core
technical contribution involves the construction of a BallTree spatial index
over building vertices. This enables highly efficient (O(logN) complexity)
radius queries to determine which vertices fall within the viewing circles of
millions of trajectory points across numerous trips, significantly
outperforming brute-force geometric checks. Analysis reveals two key findings:
1) Visibility is highly concentrated, identifying distinct 'visual hotspots'
receiving disproportionately high exposure compared to average locations. 2)
The aggregated visibility counts across vertices conform to a Log-Normal
distribution.
♻ ☆ Mask and Restore: Blind Backdoor Defense at Test Time with Masked Autoencoder
Deep neural networks are vulnerable to backdoor attacks, where an adversary
manipulates the model behavior through overlaying images with special triggers.
Existing backdoor defense methods often require accessing a few validation data
and model parameters, which is impractical in many real-world applications,
e.g., when the model is provided as a cloud service. In this paper, we address
the practical task of blind backdoor defense at test time, in particular for
local attacks and black-box models. The true label of every test image needs to
be recovered on the fly from a suspicious model regardless of image benignity.
We consider test-time image purification that incapacitates local triggers
while keeping semantic contents intact. Due to diverse trigger patterns and
sizes, the heuristic trigger search can be unscalable. We circumvent such
barrier by leveraging the strong reconstruction power of generative models, and
propose Blind Defense with Masked AutoEncoder (BDMAE). BDMAE detects possible
local triggers using image structural similarity and label consistency between
the test image and MAE restorations. The detection results are then refined by
considering trigger topology. Finally, we fuse MAE restorations adaptively into
a purified image for making prediction. Extensive experiments under different
backdoor settings validate its effectiveness and generalizability.
♻ ☆ SRMA-Mamba: Spatial Reverse Mamba Attention Network for Pathological Liver Segmentation in MRI Volumes
Jun Zeng, Yannan Huang, Elif Keles, Halil Ertugrul Aktas, Gorkem Durak, Nikhil Kumar Tomar, Quoc-Huy Trinh, Deepak Ranjan Nayak, Ulas Bagci, Debesh Jha
Liver Cirrhosis plays a critical role in the prognosis of chronic liver
disease. Early detection and timely intervention are critical in significantly
reducing mortality rates. However, the intricate anatomical architecture and
diverse pathological changes of liver tissue complicate the accurate detection
and characterization of lesions in clinical settings. Existing methods
underutilize the spatial anatomical details in volumetric MRI data, thereby
hindering their clinical effectiveness and explainability. To address this
challenge, we introduce a novel Mamba-based network, SRMA-Mamba, designed to
model the spatial relationships within the complex anatomical structures of MRI
volumes. By integrating the Spatial Anatomy-Based Mamba module (SABMamba),
SRMA-Mamba performs selective Mamba scans within liver cirrhotic tissues and
combines anatomical information from the sagittal, coronal, and axial planes to
construct a global spatial context representation, enabling efficient
volumetric segmentation of pathological liver structures. Furthermore, we
introduce the Spatial Reverse Attention module (SRMA), designed to
progressively refine cirrhotic details in the segmentation map, utilizing both
the coarse segmentation map and hierarchical encoding features. Extensive
experiments demonstrate that SRMA-Mamba surpasses state-of-the-art methods,
delivering exceptional performance in 3D pathological liver segmentation. Our
code is available for public: https://github.com/JunZengz/SRMA-Mamba.
comment: 9 pages, 4 figures
♻ ☆ MCN-SLAM: Multi-Agent Collaborative Neural SLAM with Hybrid Implicit Neural Scene Representation
Tianchen Deng, Guole Shen, Xun Chen, Shenghai Yuan, Hongming Shen, Guohao Peng, Zhenyu Wu, Jingchuan Wang, Lihua Xie, Danwei Wang, Hesheng Wang, Weidong Chen
Neural implicit scene representations have recently shown promising results
in dense visual SLAM. However, existing implicit SLAM algorithms are
constrained to single-agent scenarios, and fall difficulties in large-scale
scenes and long sequences. Existing NeRF-based multi-agent SLAM frameworks
cannot meet the constraints of communication bandwidth. To this end, we propose
the first distributed multi-agent collaborative neural SLAM framework with
hybrid scene representation, distributed camera tracking, intra-to-inter loop
closure, and online distillation for multiple submap fusion. A novel
triplane-grid joint scene representation method is proposed to improve scene
reconstruction. A novel intra-to-inter loop closure method is designed to
achieve local (single-agent) and global (multi-agent) consistency. We also
design a novel online distillation method to fuse the information of different
submaps to achieve global consistency. Furthermore, to the best of our
knowledge, there is no real-world dataset for NeRF-based/GS-based SLAM that
provides both continuous-time trajectories groundtruth and high-accuracy 3D
meshes groundtruth. To this end, we propose the first real-world Dense slam
(DES) dataset covering both single-agent and multi-agent scenarios, ranging
from small rooms to large-scale outdoor scenes, with high-accuracy ground truth
for both 3D mesh and continuous-time camera trajectory. This dataset can
advance the development of the research in both SLAM, 3D reconstruction, and
visual foundation model. Experiments on various datasets demonstrate the
superiority of the proposed method in both mapping, tracking, and
communication. The dataset and code will open-source on
https://github.com/dtc111111/mcnslam.
♻ ☆ ReservoirTTA: Prolonged Test-time Adaptation for Evolving and Recurring Domains
Guillaume Vray, Devavrat Tomar, Xufeng Gao, Jean-Philippe Thiran, Evan Shelhamer, Behzad Bozorgtabar
This paper introduces ReservoirTTA, a novel plug-in framework designed for
prolonged test-time adaptation (TTA) in scenarios where the test domain
continuously shifts over time, including cases where domains recur or evolve
gradually. At its core, ReservoirTTA maintains a reservoir of
domain-specialized models -- an adaptive test-time model ensemble -- that both
detects new domains via online clustering over style features of incoming
samples and routes each sample to the appropriate specialized model, and
thereby enables domain-specific adaptation. This multi-model strategy overcomes
key limitations of single model adaptation, such as catastrophic forgetting,
inter-domain interference, and error accumulation, ensuring robust and stable
performance on sustained non-stationary test distributions. Our theoretical
analysis reveals key components that bound parameter variance and prevent model
collapse, while our plug-in TTA module mitigates catastrophic forgetting of
previously encountered domains. Extensive experiments on the classification
corruption benchmarks, including ImageNet-C and CIFAR-10/100-C, as well as the
Cityscapes$\rightarrow$ACDC semantic segmentation task, covering recurring and
continuously evolving domain shifts, demonstrate that ReservoirTTA
significantly improves adaptation accuracy and maintains stable performance
across prolonged, recurring shifts, outperforming state-of-the-art methods. Our
code is publicly available at https://github.com/LTS5/ReservoirTTA.
♻ ☆ WHALES: A Multi-Agent Scheduling Dataset for Enhanced Cooperation in Autonomous Driving
Cooperative perception research is hindered by the limited availability of
datasets that capture the complexity of real-world Vehicle-to-Everything (V2X)
interactions, particularly under dynamic communication constraints. To address
this gap, we introduce WHALES (Wireless enhanced Autonomous vehicles with Large
number of Engaged agents), the first large-scale V2X dataset explicitly
designed to benchmark communication-aware agent scheduling and scalable
cooperative perception. WHALES introduces a new benchmark that enables
state-of-the-art (SOTA) research in communication-aware cooperative perception,
featuring an average of 8.4 cooperative agents per scene and 2.01 million
annotated 3D objects across diverse traffic scenarios. It incorporates detailed
communication metadata to emulate real-world communication bottlenecks,
enabling rigorous evaluation of scheduling strategies. To further advance the
field, we propose the Coverage-Aware Historical Scheduler (CAHS), a novel
scheduling baseline that selects agents based on historical viewpoint coverage,
improving perception performance over existing SOTA methods. WHALES bridges the
gap between simulated and real-world V2X challenges, providing a robust
framework for exploring perception-scheduling co-design, cross-data
generalization, and scalability limits. The WHALES dataset and code are
available at https://github.com/chensiweiTHU/WHALES.
♻ ☆ VoiceCloak: A Multi-Dimensional Defense Framework against Unauthorized Diffusion-based Voice Cloning
Diffusion Models (DMs) have achieved remarkable success in realistic voice
cloning (VC), while they also increase the risk of malicious misuse. Existing
proactive defenses designed for traditional VC models aim to disrupt the
forgery process, but they have been proven incompatible with DMs due to the
intricate generative mechanisms of diffusion. To bridge this gap, we introduce
VoiceCloak, a multi-dimensional proactive defense framework with the goal of
obfuscating speaker identity and degrading perceptual quality in potential
unauthorized VC. To achieve these goals, we conduct a focused analysis to
identify specific vulnerabilities within DMs, allowing VoiceCloak to disrupt
the cloning process by introducing adversarial perturbations into the reference
audio. Specifically, to obfuscate speaker identity, VoiceCloak first targets
speaker identity by distorting representation learning embeddings to maximize
identity variation, which is guided by auditory perception principles.
Additionally, VoiceCloak disrupts crucial conditional guidance processes,
particularly attention context, thereby preventing the alignment of vocal
characteristics that are essential for achieving convincing cloning. Then, to
address the second objective, VoiceCloak introduces score magnitude
amplification to actively steer the reverse trajectory away from the generation
of high-quality speech. Noise-guided semantic corruption is further employed to
disrupt structural speech semantics captured by DMs, degrading output quality.
Extensive experiments highlight VoiceCloak's outstanding defense success rate
against unauthorized diffusion-based voice cloning. Audio samples of VoiceCloak
are available at https://voice-cloak.github.io/VoiceCloak/.
♻ ☆ MR-EEGWaveNet: Multiresolutional EEGWaveNet for Seizure Detection from Long EEG Recordings
Feature engineering for generalized seizure detection models remains a
significant challenge. Recently proposed models show variable performance
depending on the training data and remain ineffective at accurately
distinguishing artifacts from seizure data. In this study, we propose a novel
end-to-end model, "Multiresolutional EEGWaveNet (MR-EEGWaveNet)," which
efficiently distinguishes seizure events from background electroencephalogram
(EEG) and artifacts/noise by capturing both temporal dependencies across
different time frames and spatial relationships between channels. The model has
three modules: convolution, feature extraction, and predictor. The convolution
module extracts features through depth-wise and spatio-temporal convolution.
The feature extraction module individually reduces the feature dimension
extracted from EEG segments and their sub-segments. Subsequently, the extracted
features are concatenated into a single vector for classification using a fully
connected classifier called the predictor module. In addition, an anomaly
score-based post-classification processing technique is introduced to reduce
the false-positive rates of the model. Experimental results are reported and
analyzed using different parameter settings and datasets (Siena (public) and
Juntendo (private)). The proposed MR-EEGWaveNet significantly outperformed the
conventional non-multiresolution approach, improving the F1 scores from 0.177
to 0.336 on Siena and 0.327 to 0.488 on Juntendo, with precision gains of 15.9%
and 20.62%, respectively.
comment: 33 pages, 10 figures, 18 tables
♻ ☆ FreqDGT: Frequency-Adaptive Dynamic Graph Networks with Transformer for Cross-subject EEG Emotion Recognition
Electroencephalography (EEG) serves as a reliable and objective signal for
emotion recognition in affective brain-computer interfaces, offering unique
advantages through its high temporal resolution and ability to capture
authentic emotional states that cannot be consciously controlled. However,
cross-subject generalization remains a fundamental challenge due to individual
variability, cognitive traits, and emotional responses. We propose FreqDGT, a
frequency-adaptive dynamic graph transformer that systematically addresses
these limitations through an integrated framework. FreqDGT introduces
frequency-adaptive processing (FAP) to dynamically weight emotion-relevant
frequency bands based on neuroscientific evidence, employs adaptive dynamic
graph learning (ADGL) to learn input-specific brain connectivity patterns, and
implements multi-scale temporal disentanglement network (MTDN) that combines
hierarchical temporal transformers with adversarial feature disentanglement to
capture both temporal dynamics and ensure cross-subject robustness.
Comprehensive experiments demonstrate that FreqDGT significantly improves
cross-subject emotion recognition accuracy, confirming the effectiveness of
integrating frequency-adaptive, spatial-dynamic, and temporal-hierarchical
modeling while ensuring robustness to individual differences. The code is
available at https://github.com/NZWANG/FreqDGT.
♻ ☆ Boosting Adversarial Transferability for Hyperspectral Image Classification Using 3D Structure-invariant Transformation and Weighted Intermediate Feature Divergence
Deep Neural Networks (DNNs) are vulnerable to adversarial attacks, which pose
security challenges to hyperspectral image (HSI) classification based on DNNs.
Numerous adversarial attack methods have been designed in the domain of natural
images. However, different from natural images, HSIs contains high-dimensional
rich spectral information, which presents new challenges for generating
adversarial examples. Based on the specific characteristics of HSIs, this paper
proposes a novel method to enhance the transferability of the adversarial
examples for HSI classification using 3D structure-invariant transformation and
weighted intermediate feature divergence. While keeping the HSIs structure
invariant, the proposed method divides the image into blocks in both spatial
and spectral dimensions. Then, various transformations are applied on each
block to increase input diversity and mitigate the overfitting to substitute
models. Moreover, a weighted intermediate feature divergence loss is also
designed by leveraging the differences between the intermediate features of
original and adversarial examples. It constrains the perturbation direction by
enlarging the feature maps of the original examples, and assigns different
weights to different feature channels to destroy the features that have a
greater impact on HSI classification. Extensive experiments demonstrate that
the adversarial examples generated by the proposed method achieve more
effective adversarial transferability on three public HSI datasets.
Furthermore, the method maintains robust attack performance even under defense
strategies.
♻ ☆ Beyond the Horizon: Decoupling Multi-View UAV Action Recognition via Partial Order Transfer
Action recognition in unmanned aerial vehicles (UAVs) poses unique challenges
due to significant view variations along the vertical spatial axis. Unlike
traditional ground-based settings, UAVs capture actions at a wide range of
altitudes, resulting in considerable appearance discrepancies. We introduce a
multi-view formulation tailored to varying UAV altitudes and empirically
observe a partial order among views, where recognition accuracy consistently
decreases as altitude increases. This observation motivates a novel approach
that explicitly models the hierarchical structure of UAV views to improve
recognition performance across altitudes. To this end, we propose the Partial
Order Guided Multi-View Network (POG-MVNet), designed to address drastic view
variations by effectively leveraging view-dependent information across
different altitude levels. The framework comprises three key components: a View
Partition (VP) module, which uses the head-to-body ratio to group views by
altitude; an Order-aware Feature Decoupling (OFD) module, which disentangles
action-relevant and view-specific features under partial order guidance; and an
Action Partial Order Guide (APOG), which uses the partial order to transfer
informative knowledge from easier views to more challenging ones. We conduct
experiments on Drone-Action, MOD20, and UAV, demonstrating that POG-MVNet
significantly outperforms competing methods. For example, POG-MVNet achieves a
4.7% improvement on Drone-Action and a 3.5% improvement on UAV compared to
state-of-the-art methods ASAT and FAR. Code will be released soon.
comment: 11 pages
♻ ☆ MedVisionLlama: Leveraging Pre-Trained Large Language Model Layers to Enhance Medical Image Segmentation ICCV
Large Language Models (LLMs), known for their versatility in textual data,
are increasingly being explored for their potential to enhance medical image
segmentation, a crucial task for accurate diagnostic imaging. This study
explores enhancing Vision Transformers (ViTs) for medical image segmentation by
integrating pre-trained LLM transformer blocks. Our approach, which
incorporates a frozen LLM transformer block into the encoder of a ViT-based
model, leads to substantial improvements in segmentation performance across
various medical imaging modalities. We propose a Hybrid Attention Mechanism
that combines global and local feature learning with a Multi-Scale Fusion Block
for aggregating features across different scales. The enhanced model shows
significant performance gains, including an average Dice score increase from
0.74 to 0.79 and improvements in accuracy, precision, and the Jaccard Index.
These results demonstrate the effectiveness of LLM-based transformers in
refining medical image segmentation, highlighting their potential to
significantly boost model accuracy and robustness. The source code and our
implementation are available at:
https://github.com/AS-Lab/Marthi-et-al-2025-MedVisionLlama-Pre-Trained-LLM-Layers-to-Enhance-Medical-Image-Segmentation
comment: Accepted to the CVAMD Workshop (Computer Vision for Automated Medical
Diagnosis) at the 2025 IEEE/CVF International Conference on Computer Vision
(ICCVW 2025)
♻ ☆ Refinement Module based on Parse Graph for Human Pose Estimation
Parse graphs have been widely used in human pose estimation (HPE) to model
the hierarchical structure and context relations of the human body, which has
been shown to effectively improve robustness and accuracy. But most methods
rely on parse graphs built from predefined skeletons, causing two key issues:
poor integratability with other models, and complex designs with redundant
parameters for hierarchy and context relation modeling of body. To address
these issues, we propose a novel Refinement Module based on Parse Graph (RMPG).
RMPG abandons skeleton connections and refines features by building implicit
hierarchical structures and context relations between sub-feature maps, with
strong integratability. Furthermore, our hierarchical network design
demonstrates that RMPG can model the body's hierarchical structure and context
relations with a simpler architecture and fewer parameters. RMPG operates in
two stages: the top-down decomposition recursively partitions the feature map
into a tree-structured hierarchy, where each node corresponds to a sub-feature
map; the bottom-up composition aggregates context information to progressively
refine the feature representation. Extensive experiments show that RMPG can be
flexibly embedded into various methods, including our hierarchical networks,
and consistently improves performance across multiple mainstream HPE
benchmarks. The code will be released.
♻ ☆ Upsample What Matters: Region-Adaptive Latent Sampling for Accelerated Diffusion Transformers
Diffusion transformers have emerged as an alternative to U-net-based
diffusion models for high-fidelity image and video generation, offering
superior scalability. However, their heavy computation remains a major obstacle
to real-world deployment. Existing acceleration methods primarily exploit the
temporal dimension such as reusing cached features across diffusion timesteps.
Here, we propose Region-Adaptive Latent Upsampling (RALU), a training-free
framework that accelerates inference along spatial dimension. RALU performs
mixed-resolution sampling across three stages: 1) low-resolution denoising
latent diffusion to efficiently capture global semantic structure, 2)
region-adaptive upsampling on specific regions prone to artifacts at
full-resolution, and 3) all latent upsampling at full-resolution for detail
refinement. To stabilize generations across resolution transitions, we leverage
noise-timestep rescheduling to adapt the noise level across varying
resolutions. Our method significantly reduces computation while preserving
image quality by achieving up to 7.0$\times$ speed-up on FLUX and 3.0$\times$
on Stable Diffusion 3 with minimal degradation. Furthermore, RALU is
complementary to existing temporal accelerations such as caching methods, thus
can be seamlessly integrated to further reduce inference latency without
compromising generation quality.
♻ ☆ Segment Anything in Pathology Images with Natural Language
Zhixuan Chen, Junlin Hou, Liqi Lin, Yihui Wang, Yequan Bie, Xi Wang, Yanning Zhou, Ronald Cheong Kin Chan, Hao Chen
Pathology image segmentation is crucial in computational pathology for
analyzing histological features relevant to cancer diagnosis and prognosis.
However, current methods face major challenges in clinical applications due to
limited annotated data and restricted category definitions. To address these
limitations, we propose PathSegmentor, the first text-prompted segmentation
foundation model designed specifically for pathology images. We also introduce
PathSeg, the largest and most comprehensive dataset for pathology segmentation,
built from 21 public sources and containing 275k image-mask-label triples
across 160 diverse categories. With PathSegmentor, users can perform semantic
segmentation using natural language prompts, eliminating the need for laborious
spatial inputs such as points or boxes. Extensive experiments demonstrate that
PathSegmentor outperforms specialized models with higher accuracy and broader
applicability, while maintaining a compact architecture. It significantly
surpasses existing spatial- and text-prompted models by 0.145 and 0.429 in
overall Dice scores, respectively, showing strong robustness in segmenting
complex structures and generalizing to external datasets. Moreover,
PathSegmentor's outputs enhance the interpretability of diagnostic models
through feature importance estimation and imaging biomarker discovery, offering
pathologists evidence-based support for clinical decision-making. This work
advances the development of explainable AI in precision oncology.
♻ ☆ Image Augmentation Agent for Weakly Supervised Semantic Segmentation
Weakly-supervised semantic segmentation (WSSS) has achieved remarkable
progress using only image-level labels. However, most existing WSSS methods
focus on designing new network structures and loss functions to generate more
accurate dense labels, overlooking the limitations imposed by fixed datasets,
which can constrain performance improvements. We argue that more diverse
trainable images provides WSSS richer information and help model understand
more comprehensive semantic pattern. Therefore in this paper, we introduce a
novel approach called Image Augmentation Agent (IAA) which shows that it is
possible to enhance WSSS from data generation perspective. IAA mainly design an
augmentation agent that leverages large language models (LLMs) and diffusion
models to automatically generate additional images for WSSS. In practice, to
address the instability in prompt generation by LLMs, we develop a prompt
self-refinement mechanism. It allow LLMs to re-evaluate the rationality of
generated prompts to produce more coherent prompts. Additionally, we insert an
online filter into diffusion generation process to dynamically ensure the
quality and balance of generated images. Experimental results show that our
method significantly surpasses state-of-the-art WSSS approaches on the PASCAL
VOC 2012 and MS COCO 2014 datasets.
comment: Accepted at Neurocomputing 2025
♻ ☆ Advancing Toward Robust and Scalable Fingerprint Orientation Estimation: From Gradients to Deep Learning
The study identifies a clear evolution from traditional methods to more
advanced machine learning approaches. Current algorithms face persistent
challenges, including degraded image quality, damaged ridge structures, and
background noise, which impact performance. To overcome these limitations,
future research must focus on developing efficient algorithms with lower
computational complexity while maintaining robust performance across varied
conditions. Hybrid methods that combine the simplicity and efficiency of
gradient-based techniques with the adaptability and robustness of machine
learning are particularly promising for advancing fingerprint recognition
systems. Fingerprint orientation estimation plays a crucial role in improving
the reliability and accuracy of biometric systems. This study highlights the
limitations of current approaches and underscores the importance of designing
next-generation algorithms that can operate efficiently across diverse
application domains. By addressing these challenges, future developments could
enhance the scalability, reliability, and applicability of biometric systems,
paving the way for broader use in security and identification technologies.
♻ ☆ Always Skip Attention ICCV 2025
We highlight a curious empirical result within modern Vision Transformers
(ViTs). Specifically, self-attention catastrophically fails to train unless it
is used in conjunction with a skip connection. This is in contrast to other
elements of a ViT that continue to exhibit good performance (albeit suboptimal)
when skip connections are removed. Further, we show that this critical
dependence on skip connections is a relatively new phenomenon, with previous
deep architectures (\eg, CNNs) exhibiting good performance in their absence. In
this paper, we theoretically characterize that the self-attention mechanism is
fundamentally ill-conditioned and is, therefore, uniquely dependent on skip
connections for regularization. Additionally, we propose Token Graying -- a
simple yet effective complement (to skip connections) that further improves the
condition of input tokens. We validate our approach in both supervised and
self-supervised training methods.
comment: This work has just been accepted by ICCV 2025
♻ ☆ Enhancing Visual Reliance in Text Generation: A Bayesian Perspective on Mitigating Hallucination in Large Vision-Language Models
Large Vision-Language Models (LVLMs) usually generate texts which satisfy
context coherence but don't match the visual input. Such a hallucination issue
hinders LVLMs' applicability in the real world. The key to solving
hallucination in LVLM is to make the text generation rely more on the visual
content. Most previous works choose to enhance/adjust the features/output of a
specific modality (i.e., visual or textual) to alleviate hallucinations in
LVLM, which do not explicitly or systematically enhance the visual reliance. In
this paper, we comprehensively investigate the factors which may degenerate the
visual reliance in text generation of LVLM from a Bayesian perspective. Based
on our observations, we propose to mitigate hallucination in LVLM from three
aspects. Firstly, we observe that not all visual tokens are informative in
generating meaningful texts. We propose to evaluate and remove redundant visual
tokens to avoid their disturbance. Secondly, LVLM may encode inappropriate
prior information, making it lean toward generating unexpected words. We
propose a simple yet effective way to rectify the prior from a Bayesian
perspective. Thirdly, we observe that starting from certain steps, the
posterior of next-token prediction conditioned on visual tokens may collapse to
a prior distribution which does not depend on any informative visual tokens at
all. Thus, we propose to stop further text generation to avoid hallucination.
Extensive experiments on three benchmarks including POPE, CHAIR, and MME
demonstrate that our method can consistently mitigate the hallucination issue
of LVLM and performs favorably against previous state-of-the-arts.
♻ ☆ Diffusion Noise Feature: Accurate and Fast Generated Image Detection ECAI 2025
Generative models now produce images with such stunning realism that they can
easily deceive the human eye. While this progress unlocks vast creative
potential, it also presents significant risks, such as the spread of
misinformation. Consequently, detecting generated images has become a critical
research challenge. However, current detection methods are often plagued by low
accuracy and poor generalization. In this paper, to address these limitations
and enhance the detection of generated images, we propose a novel
representation, Diffusion Noise Feature (DNF). Derived from the inverse process
of diffusion models, DNF effectively amplifies the subtle, high-frequency
artifacts that act as fingerprints of artificial generation. Our key insight is
that real and generated images exhibit distinct DNF signatures, providing a
robust basis for differentiation. By training a simple classifier such as
ResNet-50 on DNF, our approach achieves remarkable accuracy, robustness, and
generalization in detecting generated images, including those from unseen
generators or with novel content. Extensive experiments across four training
datasets and five test sets confirm that DNF establishes a new state-of-the-art
in generated image detection. The code is available at
https://github.com/YichiCS/Diffusion-Noise-Feature.
comment: Accepted by ECAI 2025
♻ ☆ Dataset Condensation with Color Compensation
Dataset condensation always faces a constitutive trade-off: balancing
performance and fidelity under extreme compression. Existing methods struggle
with two bottlenecks: image-level selection methods (Coreset Selection, Dataset
Quantization) suffer from inefficiency condensation, while pixel-level
optimization (Dataset Distillation) introduces semantic distortion due to
over-parameterization. With empirical observations, we find that a critical
problem in dataset condensation is the oversight of color's dual role as an
information carrier and a basic semantic representation unit. We argue that
improving the colorfulness of condensed images is beneficial for representation
learning. Motivated by this, we propose DC3: a Dataset Condensation framework
with Color Compensation. After a calibrated selection strategy, DC3 utilizes
the latent diffusion model to enhance the color diversity of an image rather
than creating a brand-new one. Extensive experiments demonstrate the superior
performance and generalization of DC3 that outperforms SOTA methods across
multiple benchmarks. To the best of our knowledge, besides focusing on
downstream tasks, DC3 is the first research to fine-tune pre-trained diffusion
models with condensed datasets. The FID results prove that training networks
with our high-quality datasets is feasible without model collapse or other
degradation issues. Code and generated data are available at
https://github.com/528why/Dataset-Condensation-with-Color-Compensation.
♻ ☆ Benchmarking Federated Learning for Semantic Datasets: Federated Scene Graph Generation
Federated learning (FL) enables decentralized training while preserving data
privacy, yet existing FL benchmarks address relatively simple classification
tasks, where each sample is annotated with a one-hot label. However, little
attention has been paid to demonstrating an FL benchmark that handles
complicated semantics, where each sample encompasses diverse semantic
information, such as relations between objects. Because the existing benchmarks
are designed to distribute data in a narrow view of a single semantic, managing
the complicated semantic heterogeneity across clients when formalizing FL
benchmarks is non-trivial. In this paper, we propose a benchmark process to
establish an FL benchmark with controllable semantic heterogeneity across
clients: two key steps are (i) data clustering with semantics and (ii) data
distributing via controllable semantic heterogeneity across clients. As a proof
of concept, we construct a federated PSG benchmark, demonstrating the efficacy
of the existing PSG methods in an FL setting with controllable semantic
heterogeneity of scene graphs. We also present the effectiveness of our
benchmark by applying robust federated learning algorithms to data
heterogeneity to show increased performance. To our knowledge, this is the
first benchmark framework that enables federated learning and its evaluation
for multi-semantic vision tasks under the controlled semantic heterogeneity.
Our code is available at https://github.com/Seung-B/FL-PSG.
comment: This work has been accepted for publication in Pattern Recognition
Letters