Free-Text Video Search

Predicting Information Needs

Predicting the information needs of a user is not always possible. While in some cases the information needs may be known in advance, e.g. in the form of known events of interest that should be recognized in videos when processed by an automated content ingestion system, new information needs can arise at any time; and, training new classifiers such as elaborate event recognition models all the time is not practical for various reasons (e.g. due to the need for annotated training data). For this reason, developing methods that allow searching within a video collection by using free-text queries is important; it introduces the flexibility of addressing non-predictable information needs, complementing elaborate recognition models that can be trained to address predictable such needs.

The free-text video retrieval problem is typically addressed in the recent literature by training a crossmodal deep network that learns to encode, in parallel, the videos or video parts and the corresponding textual descriptions of a suitable training dataset into a joint latent feature space. Then, this trained deep network can be used for extracting embeddings (i.e., vector representations in the joint latent feature space) of any incoming video or text snippet. Thus, videos being ingested in a content management system can be passed through the trained network once to extract and store their embeddings; and, whenever a new free-text query is issued, the query alone needs to be passed through the same network to extract its embedding and this embedding has to be simply matched to the stored video embeddings in order to retrieve the most relevant videos.

Free-Text Video Search in CRiTERIA

In CRiTERIA, we tackle the free-text video retrieval problem by investigating how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, instead of just one such space, which encode text-video pairs into comparable representations. To learn these representations, our T×V [1] network architecture is trained by following a multiple space learning procedure. This allows our T×V method to effectively exploit several complementary video and text features, instead of choosing just one such representation or having to concatenate heterogeneous representations in single one, leading to considerable improvement of the video retrieval results. We further introduce a dual softmax operation at the retrieval stage for exploiting prior text-video similarities to revise the ones computed by the T×V network, leading to additional retrieval accuracy gains.

Free-text video search is used in CRiTERIA to retrieve video content for any unpredictable query of the user, e.g. in the context of a study of migrant smuggling practices, retrieve all videos showing “A group of people walking in the woods” in the example shown below.

Vasileios Mezaris

Vasileios Mezaris is a Research Director with the Information Technologies Institute (ITI) / Centre for Research and Technology Hellas (CERTH), Thessaloniki, Greece. He is the Head of the Intelligent Digital Transformation (IDT) Laboratory of ITI/CERTH, where he leads a group of researchers working on multimedia understanding and artificial intelligence. He holds a BSc and a PhD in Electrical and Computer Engineering, both from the Aristotle University of Thessaloniki.

References:

D. Galanopoulos and V. Mezaris, “Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval“, Proc. ECCV 2022 Workshops, Oct. 2022.

The publication is available via the CRiTERIA publication portal.

Banner image by Kaitlyn Baker on Unsplash.