Two papers, supported in part by the CRiTERIA project, have been accepted for publication and will be presented during workshops of the European Conference on Computer Vision (ECCV) 2022 in Tel Aviv, October 23 – 24, 2022.
The 2022 European Conference on Computer Vision (ECCV 2022) is the top European conference in the image analysis area.
The accepted papers are:
- “Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism” by Ioanna Gkartzonika, Nikolaos Gkalelis, Vasileios Mezaris (Proc. ECCV 2022 Workshop on Vision with Biased or Scarce Data (VBSD), Oct. 2022):
In this paper two new learning-based eXplainable AI (XAI) methods for deep convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and L-CAM-Img, are proposed. Both methods use an attention mechanism that is inserted in the original (frozen) DCNN and is trained to derive class activation maps (CAMs) from the last convolutional layer’s feature maps. During training, CAMs are applied to the feature maps (L-CAM-Fm) or the input image (L-CAM-Img) forcing the attention mechanism to learn the image regions explaining the DCNN’s outcome. Experimental evaluation on ImageNet shows that the proposed methods achieve competitive results while requiring a single forward pass at the inference stage. Moreover, based on the derived explanations a comprehensive qualitative analysis is performed providing valuable insight for understanding the reasons behind classification errors, including possible dataset biases affecting the trained classifier.
For more information about the workshop click here.
- “Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval” by Damianos Galanopoulos and Vasileios Mezaris (Proc. ECCV 2022 Workshop on AI for Creative Video Editing and Understanding (CVEU), Oct. 2022):
In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network.
For more information about the workshop click here.
Download the Publications
Both publications, as well as the slides shared at the events to present the papers, are available in our publications portal.