Latest Advances In Video And Multimodal Retrieval Papers August 2025

Aug 11, 2025 by ADMIN 69 views

Latest 15 Papers - August 11, 2025: PapowFish, DailyArXiv

Hey everyone! Check out the latest awesome papers I've found for you guys. This update focuses on Video Retrieval and Multimodal Retrieval, so if you're into that, you're in the right place! For an even better experience and access to more papers, definitely check out the Github page.

Video Retrieval

Title	Date	Comment
Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval	2025-08-06	ICCV 2025 Highlight
GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval	2025-08-03
Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos	2025-07-29	13 pa... 13 pages, Accepted at ACMMM2025
T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval	2025-07-28
HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning	2025-07-27	Accep... Accepted by ICCV'25. 13 pages, 6 figures, 4 tables
Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges	2025-07-25
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models	2025-07-24	ICCV 2025
Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization	2025-07-24	Accep... Accepted by ICCV 2025
Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval	2025-07-21
U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs	2025-07-20	Techn... Technical Report (in progress)
Smart Routing for Multimodal Video Retrieval: When to Search What	2025-07-12	Accep... Accepted to ICCV 2025 Multimodal Representation and Retrieval Workshop
MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian	2025-07-12	10 pa... 10 pages, 5 figures, 5 tables
BiMa: Towards Biases Mitigation for Text-Video Retrieval via Scene Element Guidance	2025-07-07	Accep... Accepted at ACM MM 2025
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents	2025-07-07	Technical Report
Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos	2025-07-03	7 pages, 10 figures

Exploring the Cutting Edge of Video Retrieval: August 11, 2025

In the dynamic field of video retrieval, recent advancements are pushing the boundaries of what's possible. This compilation of papers highlights some of the most exciting developments, offering insights into innovative techniques and approaches. Let's dive into the key themes and contributions shaping the future of video retrieval.

One prominent area of focus is the use of multimodal large language models (MLLMs) to enhance text-video retrieval. The paper "Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval" (ICCV 2025 Highlight) demonstrates the power of MLLMs in understanding and bridging the gap between textual and visual information. This approach allows for more accurate and context-aware retrieval, paving the way for improved search experiences. Understanding how to leverage different modalities—text, audio, and video—is crucial in this domain. Techniques like frame-level gated audio-visual integration, as seen in "GAID: Frame-Level Gated Audio-Visual Integration with Directional Perturbation for Text-Video Retrieval," aim to more effectively combine these modalities for better retrieval performance. These multimodal approaches recognize that videos are not just visual content but also rich sources of audio and semantic information.

Addressing biases and challenges is another critical aspect of video retrieval research. The paper "Generative Ghost: Investigating Ranking Bias Hidden in AI-Generated Videos" delves into the complexities of ranking bias in AI-generated content, a growing concern as synthetic videos become more prevalent. Similarly, "Short-Form Video Recommendations with Multimodal Embeddings: Addressing Cold-Start and Bias Challenges" tackles the cold-start problem and bias issues in recommending short-form videos, which are increasingly popular on platforms like TikTok and Instagram. Mitigating these biases is essential for ensuring fair and relevant retrieval results. Researchers are also exploring more efficient ways to process video data. "FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Vision Language Models" introduces a method for reducing video tokens, making it more feasible to apply large vision language models to video retrieval. This efficiency is crucial for real-world applications where computational resources are often limited. Similarly, "Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval" proposes a sampling technique that optimizes the selection of frames for retrieval, balancing accuracy and efficiency.

Interactive and adaptive retrieval methods are also gaining traction. "Quantifying and Narrowing the Unknown: Interactive Text-to-Video Retrieval via Uncertainty Minimization" presents an interactive approach where users can refine their queries based on uncertainty minimization, leading to more precise results. "T2VParser: Adaptive Decomposition Tokens for Partial Alignment in Text to Video Retrieval" focuses on partial alignment between text and video, allowing for more flexible and robust retrieval. These adaptive techniques acknowledge that user intent and query formulation can evolve during the retrieval process.

Finally, the development of new benchmarks and datasets is crucial for advancing the field. "MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian" introduces a new benchmark for evaluating multimodal video-text tasks in Indonesian, highlighting the importance of multilingual and multicultural resources. "Are Synthetic Videos Useful? A Benchmark for Retrieval-Centric Evaluation of Synthetic Videos" explores the utility of synthetic videos by providing a retrieval-centric evaluation framework. These benchmarks help researchers objectively compare different approaches and identify areas for improvement. In conclusion, the recent papers in video retrieval showcase a vibrant and evolving field, with research spanning multimodal integration, bias mitigation, efficiency improvements, interactive methods, and benchmark development. These advancements collectively contribute to making video retrieval more accurate, efficient, and user-friendly.

Multimodal Retrieval

Title	Date	Comment
M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation	2025-08-08
Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions	2025-08-07	Preprint
mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering	2025-08-07
UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval	2025-08-06
Understanding protein function with a multimodal retrieval-augmented foundation model	2025-08-05
ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval	2025-07-29
PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning	2025-07-28	Accep... Accepted to ACM MM 2025
VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings	2025-07-22	Accep... Accepted at RecSys 2025; DOI:https://doi.org/10.1145/3705328.3748064
U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs	2025-07-20	Techn... Technical Report (in progress)
Evaluating Multimodal Large Language Models on Educational Textbook Question Answering	2025-07-15	8 Pages
DeepWriter: A Fact-Grounded Multimodal Writing Assistant Based On Offline Knowledge Base	2025-07-14	work in process
Smart Routing for Multimodal Video Retrieval: When to Search What	2025-07-12	Accep... Accepted to ICCV 2025 Multimodal Representation and Retrieval Workshop
Llama Nemoretriever Colembed: Top-Performing Text-Image Retrieval Model	2025-07-07
MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering	2025-06-28
Universal Retrieval for Multimodal Trajectory Modeling	2025-06-27	18 pa... 18 pages, 3 figures, accepted by Workshop on Computer-use Agents @ ICML 2025

Latest Advances in Multimodal Retrieval August 11, 2025

Multimodal retrieval, an increasingly important area in AI, focuses on retrieving information using multiple types of data, such as text, images, and audio. Recent research has explored a variety of techniques to enhance multimodal retrieval, addressing challenges like data fusion, cross-modal understanding, and efficient processing. This section provides an overview of the latest papers in this field.

One of the prominent trends in multimodal retrieval is the integration of large language models (LLMs) and knowledge graphs. "mKG-RAG: Multimodal Knowledge Graph-Enhanced RAG for Visual Question Answering" presents an approach that leverages knowledge graphs to improve visual question answering, demonstrating how structured knowledge can enhance retrieval accuracy. Similarly, "VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings" explores the use of LLMs to augment CLIP embeddings for improved multimodal recommendations, showcasing the versatility of LLMs in enhancing retrieval systems. These approaches highlight the importance of leveraging semantic understanding to bridge the gap between different modalities.

Another key area of focus is developing efficient frameworks for multimodal retrieval. "PUMA: Layer-Pruned Language Model for Efficient Unified Multimodal Retrieval with Modality-Adaptive Learning" introduces a layer-pruned language model that achieves efficient multimodal retrieval through modality-adaptive learning. Efficiency is a critical factor in real-world applications, where large-scale datasets and real-time processing are often required. The use of reinforcement learning (RL) is also being explored to optimize retrieval processes. "M2IO-R1: An Efficient RL-Enhanced Reasoning Framework for Multimodal Retrieval Augmented Multimodal Generation" proposes an RL-enhanced framework for multimodal generation, demonstrating how RL can be used to improve the reasoning capabilities of retrieval systems. This approach emphasizes the importance of adaptive and intelligent retrieval strategies.

Multimodal retrieval is also making significant strides in specific application domains. "Understanding protein function with a multimodal retrieval-augmented foundation model" showcases the use of multimodal retrieval in understanding protein function, highlighting the applicability of these techniques in scientific research. "ArtSeek: Deep artwork understanding via multimodal in-context reasoning and late interaction retrieval" focuses on deep artwork understanding through multimodal reasoning, demonstrating the potential of multimodal retrieval in the arts and humanities. These applications underscore the versatility of multimodal retrieval across diverse domains.

Addressing the challenges of data heterogeneity and cross-modal alignment is crucial in multimodal retrieval. "UniFGVC: Universal Training-Free Few-Shot Fine-Grained Vision Classification via Attribute-Aware Multimodal Retrieval" introduces a training-free approach for few-shot vision classification, showcasing the ability to adapt to new classes with limited data. "Explaining Similarity in Vision-Language Encoders with Weighted Banzhaf Interactions" delves into the interpretability of vision-language encoders, providing insights into how models perceive similarity across modalities. Understanding these interactions is vital for building robust and reliable retrieval systems. Techniques like optimal transport are also being used to address the challenges of multimodal fusion. "MOTOR: Multimodal Optimal Transport via Grounded Retrieval in Medical Visual Question Answering" proposes a method based on optimal transport for medical visual question answering, demonstrating the effectiveness of this approach in aligning different modalities.

Finally, the development of universal retrieval models is an ongoing effort. "Universal Retrieval for Multimodal Trajectory Modeling" presents a universal retrieval approach for trajectory modeling, highlighting the potential for creating systems that can handle a wide range of retrieval tasks. "U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs" identifies key factors for universal multimodal retrieval, emphasizing the importance of embedding learning with MLLMs. These efforts aim to build more generalizable and adaptable retrieval systems. In summary, the latest research in multimodal retrieval spans a wide range of techniques and applications, from integrating LLMs and knowledge graphs to developing efficient frameworks and addressing data heterogeneity. These advancements are driving the field forward, making multimodal retrieval an increasingly powerful tool for information access and understanding.