VideoRAG for Law Enforcement : exploring the VideoRAG Pipeline
Gluschkoff, Teresa (2026)
Gluschkoff, Teresa
2026
All rights reserved. This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.
Julkaisun pysyvä osoite on
https://urn.fi/URN:NBN:fi:amk-202601191482
https://urn.fi/URN:NBN:fi:amk-202601191482
Tiivistelmä
Growing volumes of digital video make analysis demanding for law enforcement as manual viewing is time-consuming. Traditional metadata-based search is insufficient, and existing artificial intelligence (AI) video analytics are still limited in supporting query-based, context-aware retrieval. Development of a Video Retrieval-Augmented Generation (VideoRAG) system was therefore commissioned by the Luxembourg Police, with Europol’s Innovation Lab providing AI expertise. The first objective was to develop a VideoRAG system that retrieves video segments relevant to natural-language queries and generates text answers from their visual and audio content. The second objective was to evaluate the effectiveness of alternative pipeline configurations. A Design Science Research (DSR) approach was applied.
Two VideoRAG pipelines were implemented using pre-trained open-source models. Pipeline A used VideoPrism for text-to-video retrieval, OpenCLIP to select relevant video segments, and Gemma-3n-E2B-it to generate captions from images and audio. Pipeline B embedded frames and transcripts with OpenCLIP, indexed them with FAISS for retrieval, and used LLaVA-NeXT-Video-7B to caption retrieved frames and Mistral-7B to produce scene-level explanations. Whisper was used in both pipelines to transcribe audio.
Evaluation was carried out using subsets of the FineVideo and PLM-VideoBench datasets. Retrieval performance showed high recall, with Pipeline A achieving better video-level retrieval. In 96% of queries, a relevant video was retrieved in the top 10 results and in nearly 70% of queries at rank 1, while scene-level retrieval performance was comparable for both pipelines. For generation, the metrics indicated good performance. Pipeline A scored higher on ROUGE-L and BERTScore, while Pipeline B scored higher on
BLEU-1 and METEOR. Parameter analysis showed that using more scenes per video and frames per scene improved retrieval performance.
VideoRAG was found to be feasible and could reduce reliance on manual video viewing. Pipeline A provided a more effective end-to-end configuration, and Pipeline B produced more detailed scene-level explanations. Future development could consist of integrating the stages into a single automated pipeline, combining the retrieval of Pipeline A with the generation of Pipeline B, extending the system to larger and more complex video sets, and evaluating its practical usefulness.
Two VideoRAG pipelines were implemented using pre-trained open-source models. Pipeline A used VideoPrism for text-to-video retrieval, OpenCLIP to select relevant video segments, and Gemma-3n-E2B-it to generate captions from images and audio. Pipeline B embedded frames and transcripts with OpenCLIP, indexed them with FAISS for retrieval, and used LLaVA-NeXT-Video-7B to caption retrieved frames and Mistral-7B to produce scene-level explanations. Whisper was used in both pipelines to transcribe audio.
Evaluation was carried out using subsets of the FineVideo and PLM-VideoBench datasets. Retrieval performance showed high recall, with Pipeline A achieving better video-level retrieval. In 96% of queries, a relevant video was retrieved in the top 10 results and in nearly 70% of queries at rank 1, while scene-level retrieval performance was comparable for both pipelines. For generation, the metrics indicated good performance. Pipeline A scored higher on ROUGE-L and BERTScore, while Pipeline B scored higher on
BLEU-1 and METEOR. Parameter analysis showed that using more scenes per video and frames per scene improved retrieval performance.
VideoRAG was found to be feasible and could reduce reliance on manual video viewing. Pipeline A provided a more effective end-to-end configuration, and Pipeline B produced more detailed scene-level explanations. Future development could consist of integrating the stages into a single automated pipeline, combining the retrieval of Pipeline A with the generation of Pipeline B, extending the system to larger and more complex video sets, and evaluating its practical usefulness.
