To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout.
Video moment retrieval (VMR) identifies a specific moment in an untrimmed video for a given natural language query.
Ranked #2 on Moment Retrieval on Charades-STA
To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations.
This task is difficult due to the geometric distortion of panoramic images and the lack of a panoramic image dataset with diverse conditions, like weather or time.
Token-based masked generative models are gaining popularity for their fast inference time with parallel decoding.
Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation, a methodology of using pretrained text-to-2D diffusion models to optimize neural radiance field (NeRF) in the zero-shot setting.
We also find that view-specific special tokens can distinguish between different views and properly generate specific views even if they do not exist in the dataset, and utilizing multi-view chest X-rays can faithfully capture the abnormal findings in the additional X-rays.
Multi-resolution hash encoding has recently been proposed to reduce the computational cost of neural renderings, such as NeRF.
Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames.
Ranked #6 on Video Question Answering on NExT-QA
Experiments on standard benchmarks demonstrate the effectiveness of the method, in particular when label noise complicates the identification of bias-conflicting examples.
Video corpus moment retrieval (VCMR) is the task to retrieve the most relevant video moment from a large video corpus using a natural language query.
Ranked #2 on Video Corpus Moment Retrieval on TVR
To combine parameter-efficient adaptation and model compression, we propose AlphaTuning consisting of post-training quantization of the pre-trained language model and fine-tuning only some parts of quantized parameters for a target task.
As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1. 2M to 12. 9M QA data).
Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID).
Ranked #1 on Human Judgment Classification on Pascal-50S
Cross-domain few-shot learning (CD-FSL), where there are few target samples under extreme differences between source and target domains, has recently attracted huge attention.
This data enables self-supervised pre-training on the target domain, in addition to supervised pre-training on the source domain.
We present the efficiency of semi-orthogonal embedding for unsupervised anomaly segmentation.
Ranked #1 on Unsupervised Anomaly Detection on KolektorSDD (using extra training data)
When a person identifies objects, he or she can think by associating objects to many classes and conclude by taking inter-class relations into account.
To validate our method, we experiment on meta-transfer learning and few-shot learning tasks for multiple settings.
The task is divided into two stages, 1) the classification of each message, and 2) the classification of the entire conversation.
Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context.
We confirm the best performance of the dual attention mechanism combined with late fusion by ablation studies.
In this paper, we propose bilinear attention networks (BAN) that find bilinear attention distributions to utilize given vision-language information seamlessly.
Ranked #11 on Phrase Grounding on Flickr30k Entities Test
Kim et al. (2016) show that the Hadamard product in multimodal deep networks, which is well-known for the joint function of visual question answering tasks, implicitly performs an attentional mechanism for visual inputs.
The game involves two players: a Teller and a Drawer.
Catastrophic forgetting is a problem of neural networks that loses the information of the first task after training the second task.
We present Multimodal Residual Networks (MRN) for the multimodal residual learning of visual question-answering, which extends the idea of the deep residual learning.