CRiTERIA Publications

Welcome to the CRiTERIA Project website’s section featuring the project’s scientific papers and publications! 

Contact us via this form in case you have any questions about the materials available for download. 

We are on Zenodo!

View the curated CRiTERIA project publications in our Zenodo community.

Publication

Reliability Estimation of News Media Sources: Birds of a Feather Flock Together

Authors: Sergio Burdisso, Dairazalia Sánchez-Cortés, Esaú Villatoro-Tello, and Petr Motlicek | IDIAP

Evaluating the reliability of news sources is a routine task for journalists and organizations committed to acquiring and disseminating accurate information. Recent research has shown that predicting sources’ reliability represents an important first-prior step in addressing additional challenges such as fake news detection and fact-checking. In this paper, we introduce a novel approach for source reliability estimation that leverages reinforcement learning strategies for estimating the reliability degree of news sources. Contrary to previous research, our proposed approach models the problem as the estimation of a reliability degree, and not a reliability label, based on how all the news media sources interact with each other on the Web. We validated the effectiveness of our method on a news media reliability dataset that is an order of magnitude larger than comparable existing datasets. Results show that the estimated reliability degrees strongly correlates with journalists-provided scores (Spearman=0.80) and can effectively predict reliability labels (macro-avg. F1 score=81.05). We release our implementation and dataset, aiming to provide a valuable resource for the NLP community working on information verification.

2024 Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL) | June 16-21, 2024

Publication

Unveiling the silent majority: stance detection and characterization of passive users on social media using collaborative filtering and graph convolutional networks

Authors: Zhiwei Zhou and Erick Elejalde | L3S Research Center

Social Media (SM) has become a popular medium for individuals to share their opinions on various topics, including politics, social issues, and daily affairs. During controversial events such as political elections, active users often proclaim their stance and try to persuade others to support them. However, disparities in participation levels can lead to misperceptions and cause analysts to misjudge the support for each side. For example, current models usually rely on content production and overlook a vast majority of civically engaged users who passively consume information. These “silent users” can significantly impact the democratic process despite being less vocal. Accounting for the stances of this silent majority is critical to improving our reliance on SM to understand and measure social phenomena. Thus, this study proposes and evaluates a new approach for silent users’ stance prediction based on collaborative filtering and Graph Convolutional Networks, which exploits multiple relationships between users and topics. Furthermore, our method allows us to describe users with different stances and online behaviors. We demonstrate its validity using real-world datasets from two related political events. Specifically, we examine user attitudes leading to the Chilean constitutional referendums in 2020 and 2022 through extensive Twitter datasets. In both datasets, our model outperforms the baselines by over 9% at the edge- and the user level. Thus, our method offers an improvement in effectively quantifying the support and creating a multidimensional understanding of social discussions on SM platforms, especially during polarizing events.

Published in: EPJ Data Science.

Publication

T-TAME: Trainable Attention Mechanism for Explaining Convolutional Networks and Vision Transformers

Authors: Mariano V. Ntrougkas, Nikolaos Gkalelis, and Vasileios Mezaris | CERTH

The development and adoption of Vision Transformers and other deep-learning architectures for image classification tasks has been rapid. However, the “black box” nature of neural networks is a barrier to adoption in applications where explainability is essential. While some techniques for generating explanations have been proposed, primarily for Convolutional Neural Networks, adapting such techniques to the new paradigm of Vision Transformers is non-trivial. This paper presents T-TAME, Transformer-compatible Trainable Attention Mechanism for Explanations, a general methodology for explaining deep neural networks used in image classification tasks. The proposed architecture and training technique can be easily applied to any convolutional or Vision Transformer-like neural network, using a streamlined training approach. After training, explanation maps can be computed in a single forward pass; these explanation maps are comparable to or outperform the outputs of computationally expensive perturbation-based explainability techniques, achieving SOTA performance. We apply T-TAME to three popular deep learning classifier architectures, VGG-16, ResNet-50, and ViT-B-16, trained on the ImageNet dataset, and we demonstrate improvements over existing state-of-the-art explainability methods. A detailed analysis of the results and an ablation study provide insights into how the T-TAME design choices affect the quality of the generated explanation maps.

Publication

Cross-modal Networks, Fine-Tuning, Data Augmentation and Dual Softmax Operation for MediaEval NewsImages 2023

Authors: Antonios Levantakis, Damianos Galanopoulos and Vasileios Mezaris | CERTH

Matching images to articles is challenging and can be considered a special version of the cross-media retrieval problem. This notebook paper presents our solution for the MediaEval NewsImages 2023 benchmarking task. We investigate the performance of pre-trained cross-modal networks. Specifically, we investigate two pre-trained CLIP model variations and fine-tuned one for domain adaptation. Additionally, we utilize a data augmentation technique and a method for revising the similarities produced by either one of the networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report the official results for our submitted runs and additional experiments we conducted to evaluate our runs internally. We conclude that fine-tuning benefits the performance, and it is important to consider the data’s nature when selecting the appropriate pre-trained CLIP model.

MediaEval Multimedia Evaluation Workshop (MediaEval’23) | February 1-2 , 2024

Publication

Exploring Multi-Modal Fusion for Image Manipulation Detection and Localization

Authors: Konstantinos Triaridis and Vasileios Mezaris | CERTH

Recent image manipulation localization and detection techniques usually leverage forensic artifacts and traces that are produced by a noise-sensitive filter, such as SRM and Bayar convolution. In this paper, we showcase that different filters commonly used in such approaches excel at unveiling different types of manipulations and provide complementary forensic traces. Thus, we explore ways of merging the outputs of such filters and aim to leverage the complementary nature of the artifacts produced to perform image manipulation localization and detection (IMLD). We propose two distinct methods: one that produces independent features from each forensic filter and then fuses them (this is referred to as late fusion) and one that performs early mixing of different modal outputs and produces early combined features (this is referred to as early fusion). We demonstrate that both approaches achieve competitive performance for both image manipulation localization and detection, outperforming state-of-the-art models across several datasets.

International Conference on Multimedia Modeling (MMM) | January 29 – February 2 , 2024

Publication

Masked Feature Modelling for the unsupervised pre-training of a Graph Attention Network block for bottom-up video event recognition

Authors: Dimitrios Daskalakis, Nikolaos Gkalelis, and Vasileios Mezaris | CERTH

In this paper, we introduce Masked Feature Modelling (MFM), a novel approach for the unsupervised pretraining of a Graph Attention Network (GAT) block. MFM utilizes a pretrained Visual Tokenizer to reconstruct masked features of objects within a video, leveraging the MiniKinetics dataset. We then incorporate the pre-trained GAT block into a state-of-the-art bottom-up supervised video-event recognition architecture, ViGAT, to improve the model’s starting point and overall accuracy. Experimental evaluations on the YLI-MED dataset demonstrate the effectiveness of MFM in improving event recognition performance.

IEEE International Symposium on Multimedia (ISM 2023) | December 11-13, 2023

Publication

ITI-CERTH participation in AVS Task of TRECVID 2023

Authors: Damianos Galanopoulos and Vasileios Mezaris | CERTH

This report presents an overview of the runs submitted to Ad-hoc Video Search (AVS) on behalf of the ITI-CERTH team. Our participation in the AVS task is based on a transformer-based extension of a cross-modal deep network architecture. We analyzed visual information at multiple levels of granularity using detected objects. During the retrieval stage, we employed a dual-softmax approach to adjust the calculated text-video similarities.

TRECVID 2023 | November, 2023

Publication

Bridging Qualitative Data Silos: The Potential of Reusing Codings Through Machine Learning Based Cross-Study Code Linking

Authors: Sergej Wildemann, Claudia Niederée, and Erick Elejalde | L3S Research Center

For qualitative data analysis (QDA), researchers assign codes to text segments to arrange the information into topics or concepts. These annotations facilitate information retrieval and the identification of emerging patterns in unstructured data. However, this metadata is typically not published or reused after the research. Subsequent studies with similar research questions require a new definition of codes and do not benefit from other analysts’ experience. Machine learning (ML) based classification seeded with such data remains a challenging task due to the ambiguity of code definitions and the inherent subjectivity of the exercise. Previous attempts to support QDA using ML rely on linear models and only examined individual datasets that were either smaller or coded specifically for this purpose. However, we show that modern approaches effectively capture at least part of the codes’ semantics and may generalize to multiple studies. We analyze the performance of multiple classifiers across three large real-world datasets. Furthermore, we propose an ML-based approach to identify semantic relations of codes in different studies to show thematic faceting, enhance retrieval of related content, or bootstrap the coding process. These are encouraging results that suggest how analysts might benefit from prior interpretation efforts, potentially yielding new insights into qualitative data.

Social Science Computer Review | November 13, 2023

ABSTRACT

Chile’s Internal Migration Dynamics during the COVID-19 Pandemic

Authors: Erick Elejalde, Victor Navarro, Loreto Bravo, Leo Ferres, and Emilio Zagheni

The “Chile’s Internal Migration Dynamics during the COVID-19 Pandemic” talk was given at the 9th International Conference on Computational Social Science (IC2S2), July 17-20, 2023, Copenhagen, Denmark.

Publication

Migration Reframed? A multilingual analysis on the stance shift in Europe during the Ukrainian crisis

Authors: Sergej Wildemann, Claudia Niederée, and Erick Elejalde | L3S Research Center

The war in Ukraine seems to have positively changed the attitude toward the critical societal topic of migration in Europe — at least towards refugees from Ukraine. We investigate whether this impression is substantiated by how the topic is reflected in online news and social media, thus linking the representation of the issue on the Web to its perception in society. For this purpose, we combine and adapt leading-edge automatic text processing for a novel multilingual stance detection approach. Starting from 5.5M Twitter posts published by 565 European news outlets in one year, beginning September 2021, plus replies, we perform a multilingual analysis of migration-related media coverage and associated social media interaction for Europe and selected European countries.

The results of our analysis show that there is actually a reframing of the discussion illustrated by the terminology change, e.g., from “migrant” to “refugee”, often even accentuated with phrases such as “real refugees”. However, concerning a stance shift in public perception, the picture is more diverse than expected. All analyzed cases show a noticeable temporal stance shift around the start of the war in Ukraine. Still, there are apparent national differences in the size and stability of this shift.

This paper is published in The Web Conference 2023 | April 30 – May 4, 2023

Publication

Stance Inference in Twitter through Graph Convolutional Collaborative Filtering Networks with Minimal Supervision

Authors: Zhiwei Zhou and Erick Elejalde | L3S Research Center

Social Media (SM) has become a stage for people to share thoughts, emotions, opinions, and almost every other aspect of their daily lives. This abundance of human interaction makes SM particularly attractive for social sensing. Especially during polarizing events such as political elections or referendums, users post information and encourage others to support their side, using symbols such as hashtags to represent their attitudes. However, many users choose not to attach hashtags to their messages, use a different language, or show their position only indirectly. Thus, automatically identifying their opinions becomes a more challenging task. To uncover these implicit perspectives, we propose a collaborative filtering model based on Graph Convolutional Networks that exploits the textual content in messages and the rich connections between users and topics. Moreover, our approach only requires a small annotation effort compared to state-of-the-art solutions. Nevertheless, the proposed model achieves competitive performance in predicting individuals’ stances. We analyze users’ attitudes ahead of two constitutional referendums in Chile in 2020 and 2022. Using two large Twitter datasets, our model achieves improvements of 3.4% in recall and 3.6% in accuracy over the baselines.

This paper is published in The Web Conference 2023 | April 30 – May 4, 2023

Publication

Learning Faithful Attention for Interpretable Classification of Crisis-Related Microblogs under Constrained Human Budget

Authors: Thi Huyen Nguyen and Koustav Rudra

The recent widespread use of social media platforms has created convenient ways to obtain and spread up-to-date information during crisis events such as disasters. Time-critical analysis of crisis data can help human organizations gain actionable information and plan for aid responses. Many existing studies have proposed methods to identify informative messages and categorize them into different humanitarian classes. Advanced neural network architectures tend to achieve state-of-the-art performance, but the model decisions are opaque. While attention heatmaps show insights into the model’s prediction, some studies found that standard attention does not provide meaningful explanations. Alternatively, recent works proposed interpretable approaches for the classification of crisis events that rely on human rationales to train and extract short snippets as explanations. However, the rationale annotations are not always available, especially in real-time situations for new tasks and events. In this paper, we propose a two-stage approach to learn the rationales under minimal human supervision and derive faithful machine attention. Extensive experiments over four crisis events show that our model is able to obtain better or comparable classification performance (~86% Macro-F1) to baselines and faithful attention heatmaps using only 40-50% human-level supervision. Further, we employ a zero-shot learning setup to detect actionable tweets along with actionable word snippets as rationales.

This paper is published in The Web Conference 2023 | April 30 – May 4, 2023

Publication

Claim-Dissector: An Interpretable Fact-Checking System with Joint Re-ranking and Veracity Prediction

Authors: Martin Fajcik, Petr Motlicek, and Pavel Smrz

We present Claim-Dissector: a novel latent variable model for fact-checking and analysis, which given a claim and a set of retrieved evidence jointly learns to identify: (i) the relevant evidences to the given claim (ii) the veracity of the claim. We propose to disentangle the per-evidence relevance probability and its contribution to the final veracity probability in an interpretable way — the final veracity probability is proportional to a linear ensemble of per-evidence relevance probabilities. In this way, the individual contributions of evidences towards the final predicted probability can be identified. In per-evidence relevance probability, our model can further distinguish whether each relevant evidence is supporting (S) or refuting (R) the claim. This allows to quantify how much the S/R probability contributes to final verdict or to detect disagreeing evidence.

Despite its interpretable nature, our system achieves results competetive with state-of-the-art on the FEVER dataset, as compared to typical two-stage system pipelines, while using significantly fewer parameters. Furthermore, our analysis shows that our model can learn fine-grained relevance cues while using coarse-grained supervision and we demonstrate it in 2 ways. (i) We show that our model can achieve competitive sentence recall while using only paragraph-level relevance supervision. (ii) Traversing towards the finest granularity of relevance, we show that our model is capable of identifying relevance at the token level. To do this, we present a new benchmark TLR-FEVER focusing on token-level interpretability — humans annotate tokens in relevant evidences they considered essential when making their judgment. Then we measure how similar are these annotations to the tokens our model is focusing on.

Findings of the Association for Computational Linguistics: ACL 2023, July 9014, 2023

ABSTRACT

Impact of COVID-19 on Chile’s Internal Migration

Authors: Erick Elejalde, Victor Navarro, Loreto Bravo, and Leo Ferres

The “Impact of COVID-19 on Chile’s Internal Migration” paper was submitted to the NetSci-X 2023 conference. The full paper will be available soon.

ABSTRACT

Migration Reframed? Multilingual analysis on the stance shift in Europe during the Ukrainian crisis

Authors: Sergej Wildemann and Erick Elejalde | L3S Research Center

The “Migration Reframed? Multilingual analysis on the stance shift in Europe during the Ukrainian crisis” paper was presented at the NetSci-X 2023 conference

Publication

Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022

Authors: Damianos Galanopoulos and Vasileios Mezaris | CERTH-ITI

Matching images to articles is challenging and can be considered a special version of the cross-media retrieval problem. This working note paper presents our solution for the MediaEval NewsImages benchmarking task. We investigated the performance of two cross-modal networks, a pre-trained network and a trainable one, the latter originally developed for text-video retrieval tasks and adapted to the NewsImages task. Moreover, we utilize a method for revising the similarities produced by either one of the cross-modal networks, i.e., a dual softmax operation, to improve our solutions’ performance. We report the official results for our submitted runs and additional experiments we conducted to evaluate our runs internally.

Multimedia Evaluation Workshop (MediaEval’22), January 12-13, 2023, Bergen, Norway.

SLIDES

Cross-modal Networks and Dual Softmax Operation for MediaEval NewsImages 2022

Authors: Damianos Galanopoulos and Vasileios Mezaris | CERTH-ITI

The following slides were part of the paper presentation at the Multimedia Evaluation Workshop (MediaEval’22), January 12-13, 2023, Bergen, Norway.

Publication

VERGE in VBS 2023

Authors: Nick Pantelidis, Stelios Andreadis, Maria Pegia, Anastasia Moumtzidou, Damianos Galanopoulos, Konstantinos Apostolidis, Despoina Touska, Konstantinos Gkountakos, Ilias Gialampoukidis, Stefanos Vrochidis, Vasileios Mezaris, and Ioannis Kompatsiaris | CERTH-ITI

This paper describes VERGE, an interactive video retrieval system for browsing a collection of images from videos and searching for specific content. The system utilizes many retrieval techniques as well as fusion and reranking capabilities. A Web Application is also part of VERGE, where a user can create queries, view the top results and submit the appropriate data, all in a user-friendly way.

International Conference on Multimedia Modeling (MMM2023), January 9-12, 2023, Bergen, Norway.

Publication

Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism

Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, and Vasileios Mezaris | CERTH-ITI

In this paper, Gated-ViGAT, an efficient approach for video event recognition, utilizing bottom-up (object) information, a new frame sampling policy and a gating mechanism is proposed. Specifically, the frame sampling policy uses weighted in-degrees (WiDs), derived from the adjacency matrices of graph attention networks (GATs), and a dissimilarity measure to select the most salient and at the same time diverse frames representing the event in the video. Additionally, the proposed gating mechanism fetches the selected frames sequentially, and commits early-exiting when an adequately confident decision is achieved. In this way, only a few frames are processed by the computationally expensive branch of our network that is responsible for the bottom-up information extraction. The experimental evaluation on two large, publicly available video datasets (MiniKinetics, ActivityNet) demonstrates that Gated-ViGAT provides a large computational complexity reduction in comparison to our previous approach (ViGAT), while maintaining the excellent event recognition and explainability performance.

IEEE International Symposium on Multimedia 2022, December 2022, in Naples, Italy.

SLIDES

Gated-ViGAT: Efficient Bottom-Up Event Recognition and Explanation Using a New Frame Selection Policy and Gating Mechanism

Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, and Vasileios Mezaris | CERTH-ITI

The following slides were part of the paper presentation at the IEEE International Symposium on Multimedia 2022, December 2022, in Naples, Italy.

Publication

TAME: Attention Mechanism Based Feature Fusion for Generating Explanation Maps of Convolutional Neural Networks

Authors: Mariano Ntrougkas, Nikolaos Gkalelis, and Vasileios Mezaris | CERTH-ITI

The apparent “black box” nature of neural networks is a barrier to adoption in applications where explainability is essential. This paper presents TAME (Trainable Attention Mechanism for Explanations), a method for generating explanation maps with a multi-branch hierarchical attention mechanism. TAME combines a target model’s feature maps from multiple layers using an attention mechanism, transforming them into an explanation map. TAME can easily be applied to any convolutional neural network (CNN) by streamlining the optimization of the attention mechanism’s training method and the selection of target model’s feature maps. After training, explanation maps can be computed in a single forward pass. We apply TAME to two widely used models, i.e. VGG-16 and ResNet-50, trained on ImageNet and show improvements over previous top-performing methods. We also provide a comprehensive ablation study comparing the performance of different variations of TAME’s architecture.

IEEE International Symposium on Multimedia 2022, December 2022, in Naples, Italy.

SLIDES

TAME: Attention Mechanism Based Feature Fusion for Generating Explanation Maps of Convolutional Neural Networks

Authors: Mariano Ntrougkas, Nikolaos Gkalelis, and Vasileios Mezaris | CERTH-ITI

The following slides were part of the “TAME: Attention Mechanism Based Feature Fusion for Generating Explanation Maps of Convolutional Neural Networks” paper presentation delivered at the IEEE International Symposium on Multimedia 2022, December 2022, in Naples, Italy.

Publication

ITI-CERTH participation in ActEV and AVS Tracks of TRECVID 2022

Author: Konstantinos Gkountakos, Damianos Galanopoulos, Despoina Touska, Konstantinos Ioannidis, Stefanos Vrochidis, Vasileios Mezaris, and Ioannis Kompatsiaris | CERTH-ITI

This report presents the overview of the runs related to Ad-hoc Video Search (AVS) and Activities in Extended Video (ActEV) tasks on behalf of the ITI-CERTH team. Our participation in the AVS task is based on a cross-modal deep network architecture utilizing several textual and visual features. As part of the retrieval stage, a dual-softmax approach is utilized to revise the calculated text-video
similarities. For the ActEV task, we adapt our framework to fit the new dataset and overcome the challenges of detecting and recognizing activities in a multi-label manner while experimenting with two separate activity classifiers.

Proc. TRECVID 2022 Workshop, December, 2022

SLIDES

Explaining the Decisions of Image/Video Classifiers

Author: Vasileios Mezaris | CERTH-ITI

The following slides were presented at the 1st Nice Workshop on Interpretability, November 17-18, 2022, Université Côte d’Azur, Nice, France.

Publication

L3S at TREC 2022 CrisisFACTS track

Authors: Thi Huyen Nguyen and Koustav Rudra

This paper describes our proposed approach for the multi-stream summarization of the crisis-related events in the TREC 2022 CrisisFACTS track. We apply a retrieval and ranking-based two-step summarization approach. First, we employ a sparse retrieval framework where content texts from multiple online streams are treated as a document corpus, and a term matching-based retrieval strategy is used to retrieve relevant contents, so-called facts, to the set of queries in a given event day. Next, we use several pre-trained models to measure semantic similarity between query-fact or fact-fact pairs, score and rank the facts for the extraction of daily event summaries.

TREC2022: 31st Text Retrieval Conference (TREC) | November 14-18, 2022

Publication

CrisICSum: Interpretable Classification and Summarization Platform for Crisis Events from Microblogs

Authors: Thi Huyen Nguyen, Miroslav Shaltev, and Koustav Rudra

Microblogging platforms such as Twitter, receive massive messages during crisis events. Real-time insights are crucial for emergency response. Hence, there is a need to develop faithful tools for efficiently digesting information. In this paper, we present CrisICSum, a platform for classification and summarization of crisis events. The objective of CrisICSum is to classify user posts during disaster events into different humanitarian classes (i.e., damage, affected people, etc.) and generate summaries of class-level messages. Unlike existing systems, CrisICSum employs an interpretable by design backend classifier. It can generate explanations for output decisions. Besides, the platform allows user feedback on both classification and summarization phases. CrisICSum is designed and run as an easily integrated web application. Backend models are interchangeable. The system can assist users and human organizations in improving response efforts during disaster situations. CrisICSum is available at https://crisicsum.l3s.uni-hannover.de 

CIKM’22: Proc. of the 31st ACM International Conference on Information & Knowledge Management | October 17-21, 2022

Publication

Rationale Aware Contrastive Learning Based Approach to Classify and Summarize Crisis-Related Microblogs

Authors: Thi Huyen Nguyen and Koustav Rudra

Recent fashion of information propagation on Twitter makes the platform a crucial conduit for tactical data and emergency responses during disasters. However, the real-time information about crises is immersed in a large volume of emotional and irrelevant posts. It brings the necessity to develop an automatic tool to identify disaster-related messages and summarize the information for data consumption and situation planning. Besides, explainability of the methods is crucial in determining their applicability in real-life scenarios. Recent studies also highlight the importance of learning a good latent representation of tweets for several downstream tasks. In this paper, we take advantage of state-of-the-art methods, such as transformers and contrastive learning to build an interpretable classifier. Our proposed model classifies Twitter messages into different humanitarian categories and also extracts rationale snippets as supporting evidence for output decisions. The contrastive learning framework helps to learn better representations of tweets by bringing the related tweets closer in the embedding space. Furthermore, we employ classification labels and rationales to efficiently generate summaries of crisis events. Extensive experiments over different crisis datasets show that (i). our classifier obtains the best performance-interpretability trade-off, (ii). the proposed summarizer shows superior performance (1.4%-22% improvement) with significantly less computation cost than baseline models.

CIKM’22: Proc. of the 31st ACM International Conference on Information & Knowledge Management | October 17-21, 2022

SLIDES

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval – Slides

Authors: Damianos Galanopoulos and Vasileios Mezaris | CERTH-ITI

The following slides were presented at the ECCV 2022 Workshop on AI for Creative Video Editing and Understanding (CVEU) in October 2022 to discuss the “Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval” paper.

Publication

Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Authors: Damianos Galanopoulos and Vasileios Mezaris | CERTH-ITI

In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple-space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network.

ECCV 2022 Workshop on AI for Creative Video Editing and Understanding (CVEU) | October 16, 2022

SLIDES

Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism – Slides

Authors: Ioanna Gkartzonika, Nikolaos Gkalelis, Vasileios Mezaris | CERTH-ITI

The following slides were presented at the ECCV 2022 Workshop on Vision with Biased or Scarce Data (VBSD) in October 2022 to discuss the “Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism” paper.

Publication

Learning Visual Explanations for DCNN-Based Image Classifiers Using an Attention Mechanism

Authors: Ioanna Gkartzonika, Nikolaos Gkalelis, and Vasileios Mezaris | CERTH-ITI

In this paper two new learning-based eXplainable AI (XAI) methods for deep convolutional neural network (DCNN) image classifiers, called L-CAM-Fm and L-CAM-Img, are proposed. Both methods use an attention mechanism that is inserted in the original (frozen) DCNN and is trained to derive class activation maps (CAMs) from the last convolutional layer’s feature maps. During training, CAMs are applied to the feature maps (L-CAM-Fm) or the input image (L-CAM-Img) forcing the attention mechanism to learn the image regions explaining the DCNN’s outcome. Experimental evaluation on ImageNet shows that the proposed methods achieve competitive results while requiring a single forward pass at the inference stage. Moreover, based on the derived explanations a comprehensive qualitative analysis is performed providing valuable insight for understanding the reasons behind classification errors, including possible dataset biases affecting the trained classifier.

ECCV 2022 Workshop on Vision with Biased or Scarce Data (VBSD) | October 24, 2022

Publication

IDIAPers @ Causal News Corpus 2022: Extracting Cause-Effect-Signal Triplets via Pre-trained Autoregressive Language Model

Authors: Martin Fajcik, Muskaan Singh, Juan Zuluaga-Gomez, Esaú Villatoro-Tello, Sergio Burdisso, Petr Motlicek, and Pavel Smrz

In this paper, we describe our shared task submissions for Subtask 2 in CASE-2022, Event Causality Identification with Casual News Corpus. The challenge focused on the automatic detection of all cause-effect-signal spans present in the sentence from news-media. We detect cause-effect-signal spans in a sentence using T5 — a pre-trained autoregressive language model. We iteratively identify all cause-effect-signal span triplets, always conditioning the prediction of the next triplet on the previously predicted ones. To predict the triplet itself, we consider different causal relationships such as cause→effect→signal. Each triplet component is generated via a language model conditioned on the sentence, the previous parts of the current triplet, and previously predicted triplets. Despite training on an extremely small dataset of 160 samples, our approach achieved competitive performance, being placed second in the competition. Furthermore, we show that assuming either cause→effect or effect→cause order achieves similar results.

CASE@EMNLP 2022: 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text | December 7-8, 2022

Publication

IDIAPers @ Causal News Corpus 2022: Efficient Causal Relation Identification Through a Prompt-based Few-shot Approach

Authors: Sergio Burdisso, Juan Zuluaga-Gomez, Esau Villatoro-Tello, Martin Fajcik, Muskaan Singh, Pavel Smrz, and Petr Motlicek

In this paper, we describe our participation in subtask 1 of CASE-2022, Event Causality Identification with Casual News Corpus. We address the Causal Relation Identification (CRI) task by exploiting a set of simple yet complementary techniques for fine-tuning language models (LMs) on a small number of annotated examples (i.e., a few-shot configuration). We follow a prompt-based prediction approach for fine-tuning LMs in which the CRI task is treated as a masked language modeling problem (MLM). This approach allows LMs natively pre-trained on MLM problems to directly generate textual responses to CRI-specific prompts. We compare the performance of this method against ensemble techniques trained on the entire dataset. Our best-performing submission was fine-tuned with only 256 instances per class, 15.7% of all available data, and yet obtained the second-best precision (0.82), third-best accuracy (0.82), and an F1-score (0.85) very close to what was reported by the winner team (0.86).

CASE@EMNLP 2022: 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text | December 7-8, 2022

Publication

ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network

Authors: Nikolaos Gkalelis, Dimitrios Daskalakis, and Vasileios Mezaris | CERTH-ITI

In this paper, a pure-attention bottom-up approach, called ViGAT, utilizes an object detector together with a Vision Transformer (ViT) backbone network to derive object and frame features, and a head network to process these features for the task of event recognition and explanation in video is proposed. The ViGAT head consists of graph attention network (GAT) blocks factorized along the spatial and temporal dimensions in order to capture effectively both local and long-term dependencies between objects or frames. Moreover, using the weighted in-degrees (WiDs) derived from the adjacency matrices at the various GAT blocks, we show that the proposed architecture can identify the most salient objects and frames that explain the decision of the network. A comprehensive evaluation study is performed, demonstrating that the proposed approach provides state-of-the-art results on three large, publicly available video datasets (FCVID, MiniKinetics, ActivityNet). The source code is made publicly available at: https://github.com/bmezaris/ViGAT

IEEE Access | October 2022