To provide a more thorough evaluation of the capabilities of long video retrieval systems, we propose a pipeline that leverages state-of-the-art large language models to carefully generate a diverse set of synthetic captions for long videos.
The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences.
Paraphrasing of offensive content is a better alternative to content removal and helps improve civility in a communication environment.
We present SayNav, a new approach that leverages human knowledge from Large Language Models (LLMs) for efficient generalization to complex navigation tasks in unknown large-scale environments.
Based on this pipeline and the existing coarse-grained annotated dataset, we build the CURE benchmark to measure both the zero-shot reasoning performance and consistency of VLMs.
We present a Multimodal Backdoor Defense technique TIJO (Trigger Inversion using Joint Optimization).
Content moderation is the process of flagging content based on pre-defined platform rules.
This is challenging for the attacker as the detector can distort or ignore the visual trigger entirely, which leads to models where backdoors are over-reliant on the language trigger.
We also observe a drop in performance across all the models when testing on RecipeQA and proposed Meta-RecipeQA (e. g. 83. 6% versus 67. 1% for HTRN), which shows that the proposed dataset is relatively less biased.
We then evaluate M3C using a textual cloze style question-answering task and highlight an inherent bias in the question answer generation method from  that enables a naive baseline to cheat by learning from only answer choices.
Recent studies have shown that neural networks are vulnerable to Trojan attacks, where a network is trained to respond to specially crafted trigger patterns in the inputs in specific and potentially malicious ways.
We improve zero-shot learning (ZSL) by incorporating common-sense knowledge in DNNs.
To enable large-scale evaluation, we introduce a new dataset containing over 550K pairs (covering 143 km^2 area) of RGB and aerial LIDAR depth images.
We introduce Deep Adaptive Semantic Logic (DASL), a novel framework for automating the generation of deep neural networks that incorporates user-provided formal knowledge to improve learning from data.
For instance, if a model answers "red" to "What color is the balloon?
Food classification is a challenging problem due to the large number of categories, high visual similarity between different foods, as well as the lack of datasets for training state-of-the-art deep models.
We also show that the user embeddings learned within our joint multimodal embedding model are better at predicting user interests compared to those learned with unimodal content on Instagram data.
Computing author intent from multimodal data like Instagram posts requires modeling a complex relationship between text and image.
We propose a novel end-to-end model that uses caption-to-image retrieval as a `downstream' task to guide the process of phrase localization.
Furthermore, we present an extensive study demonstrating the contribution of each component of our model, showing $8$--$15\%$ and $4\%$ improvement from adding semantic information and our proposed attention module.
We show that our model outperforms other baselines on the benchmark Ad dataset and also show qualitative results to highlight the advantages of using multihop co-attention.
We introduce and tackle the problem of zero-shot object detection (ZSD), which aims to detect object classes which are not observed during training.
We propose a novel method for temporally pooling frames in a video for the task of human action recognition.
The results of experiments suggest that the proposed model equipped with Dirichlet state encoding is superior in performance, and selects images that lead to better training and higher accuracy of label prediction at test time.
In recent years, Printed Circuit Boards (PCB) have become the backbone of a large number of consumer electronic devices leading to a surge in their production.