Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Futa (Kai) Waseda, Antonio Tejero-de-Pablos, Isao Echizen

March 2026

Abstract

Vision-Language Models (VLMs) are increasingly adopted in practical applications, but remain vulnerable to adversarial perturbations. Existing adversarial fine-tuning methods often rely on one-to-one image-text supervision and may overfit to narrow language cues. This work studies multimodal defense with one-to-many relationships between images and textual descriptions, improving robustness under stronger attack settings while retaining clean performance.

Type

Conference paper

Publication

WACV 2026

recent

Multimodal Adversarial Defense for Vision-Language Models by Leveraging One-To-Many Relationships

Abstract

Futa (Kai) Waseda

Project Assistant Professor | Trustworthy AI: Robustness, Reliability, and VLM Defense

Related