Video Event Detection

Video is a medium that is rich in information and has a strong temporal dimension: is typically depicts events that unfold in time. For this reason, recognizing the events depicted in video is a major challenge towards extracting useful information from video content. Deep learning techniques have achieved major performance leaps in video event recognition, and new improvements in this domain continue to push the recognition performance limits every year. Typical video event recognition methods operate in a top-down fashion, i.e., a neural network is trained using the video class labels and entire frames (or video segments) to implicitly learn to focus on the video regions that are mostly related with the occurring event. The drawbacks of this general methodology are that the recognition methods lack a deep understanding of the video being processed, thus limiting their recognition accuracy; and, they operate as black boxes, providing no explanations for their recognition decisions.

In CRiTERIA, we develop methods for video event recognition that address the above limitations. We propose a general bottom-up event recognition methodology, i.e. we propose supporting the event classifier by first automatically detecting and recognizing the objects depicted in each frame. The event classifier builds on top of these object detection results, modeling the events by considering not only the global scene but also the scene’s constituent objects and their interactions.

From a more technical perspective, this is achieved by utilizing an object detector together with a Vision Transformer (ViT) backbone network and a head network (ViGAT) [1]: the Vision Transformer is used for extracting vector representations of the detected objects as well as of the global scene, whereas the ViGAT head serves as the event recognizer, learning to focus on the important pieces of visual information that point to a specific event. This approach further contributes to explainable AI, being able to generating explanations in the form of most-salient objects and frames that lead to the event recognition decision – in this way, providing valuable insight as to how the method works and, more importantly, why it fails when it fails.

While our ViGAT method gives us state-of-the-art event recognition results, its in-depth processing of the video (i.e. the object recognition part) also introduces a considerable computational overhead. For improving the scalability of ViGAT, we further propose Gated-ViGAT [2], where the idea is to process for each video just the few frames that are most informative for making a confident event recognition decision, rather than processing a fixed and relatively high number of frames for every video. From a technical perspective, this is achieved by introducing a frame sampling policy that uses the frame-level ViGAT-generated explanations together with a gating mechanism, which commits early‐exiting based on the number of frames needed to achieve the desired event recognition accuracy, in order to limit the amount of data (video frames) that need to go through in-depth processing (i.e., object detection). In this way, Gated‐ViGAT effectively reduces the computational footprint of our original ViGAT method, while maintaining high recognition accuracy and explainability of its results.

In CRiTERIA, ViGAT and Gated-ViGAT are used for recognizing known events of interest. A frame from a video where the example event “Starting a campfire” was recognized is shown on the right.

Vasileios Mezaris

Vasileios Mezaris is a Research Director with the Information Technologies Institute (ITI) / Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece. He is the Head of the Intelligent Digital Transformation (IDT) Laboratory of ITI/CERTH, where he leads a group of researchers working on multimedia understanding and artificial intelligence. He holds a BSc and a PhD in Electrical and Computer Engineering, both from the Aristotle University of Thessaloniki.

References:

N. Gkalelis, D. Daskalakis, V. Mezaris, “ViGAT: Bottom-up event recognition and explanation in video using factorized graph attention network“, IEEE Access, vol. 10, pp. 108797-108816, 2022.
N. Gkalelis, D. Daskalakis, V. Mezaris, “Gated-ViGAT: Efficient bottom-up event recognition and explanation using a new frame selection policy and gating mechanism“, Proc. IEEE Int. Symposium on Multimedia (ISM), Naples, Italy, Dec. 2022.

All publications are available via the CRiTERIA publication portal.

Banner image by Robin Worrall on Unsplash.