Selected work
ConViS-Bench: Estimating Video Similarity Through Semantic Concepts
ImageNet-trained CNNs are not biased towards texture: Revisiting feature reliance through controlled suppression
Increasing the Utility of Synthetic Images through Chamfer Guidance
Towards a General Attention Framework on Gyrovector Spaces for Matrix Manifolds
AlignCAT: Visual-Linguistic Alignment of Category and Attributefor Weakly Supervised Visual Grounding
Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection
FreeInsert: Disentangled Text-Guided Object Insertion in 3D Gaussian Scene without Spatial Priors
FedMVP: Federated Multimodal Visual Prompt Tuning for Vision-Language Models
Generate, Refine, and Encode: Leveraging Synthesized Novel Samples for On-the-Fly Fine-Grained Category Discovery
Hierarchical Visual Prompt Learning for Continual Video Instance Segmentation
On Large Multimodal Models as Open-World Image Classifiers
Superpowering Open-Vocabulary Object Detectors for X-ray Vision
What Changed? Detecting and Evaluating Instruction-Guided Image Edits with Multimodal Large Language Models
Automatic benchmarking of large multimodal models via iterative experiment programming
Diversified in-domain synthesis with efficient fine-tuning for few-shot classification
Classifier-to-Bias: Toward Unsupervised Automatic Bias Detection for Visual Classifiers
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Multi-focal Conditioned Latent Diffusion for Person Image Synthesis
Not Only Text: Exploring Compositionality of Visual Representations in Vision-Language Models