To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout.
Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data.
In this paper, we focus on the models' visual perception alignment with humans, further referred to as AI-human visual alignment.
This task is difficult due to the geometric distortion of panoramic images and the lack of a panoramic image dataset with diverse conditions, like weather or time.
Recovering 3D human mesh in the wild is greatly challenging as in-the-wild (ITW) datasets provide only 2D pose ground truths (GTs).
Ranked #5 on 3D Multi-Person Pose Estimation on MuPoTS-3D
In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters.
Ranked #1 on Action Classification on Diving-48
Text-to-3D generation has shown rapid progress in recent days with the advent of score distillation, a methodology of using pretrained text-to-2D diffusion models to optimize neural radiance field (NeRF) in the zero-shot setting.
Multi-resolution hash encoding has recently been proposed to reduce the computational cost of neural renderings, such as NeRF.
Efficient video-language modeling should consider the computational cost because of a large, sometimes intractable, number of video frames.
Ranked #7 on Video Question Answering on NExT-QA
Given an untrimmed video and a language query depicting a specific temporal moment in the video, video grounding aims to localize the time interval by understanding the text and video simultaneously.
Multi-domain Neural Machine Translation (NMT) trains a single model with multiple domains.
Specifically, we formulate a diffusion-based matching-and-generation framework that interleaves cross-domain matching and diffusion steps in the latent space by iteratively feeding the intermediate warp into the noising process and denoising it to generate a translated image.
Automatic deep learning-based examination of ECG signals can lead to inaccurate diagnosis, and manual analysis involves rejection of noisy ECG samples by clinicians, which might cost extra time.
To mitigate this issue, we propose to incorporate an auxiliary point-selective network into a meta-learning framework, called PointFix, to provide a robust initialization of stereo models for online stereo adaptation.
Based on a recent trend that multimodal generative evaluations exploit a vison-and-language pre-trained model, we propose the negative Gaussian cross-mutual information using the CLIP features as a unified metric, coined by Mutual Information Divergence (MID).
Ranked #1 on Human Judgment Classification on Pascal-50S
This paper presents Probabilistic Video Contrastive Learning, a self-supervised representation learning method that bridges contrastive learning with probabilistic representation.
We show that the proposed method produces visually diverse and plausible results in multiple domains compared to the state-of-the-art methods.
This is an exploratory study that discovers the current image quantization (vector quantization) do not satisfy translation equivariance in the quantized space due to aliasing.
EHR systems lack a unified code system forrepresenting medical concepts, which acts asa barrier for the deployment of deep learningmodels in large scale to multiple clinics and hos-pitals.
Video prediction, forecasting the future frames from a sequence of input frames, is a challenging task since the view changes are influenced by various factors, such as the global context surrounding the scene and local motion dynamics.
Domain generalization aims to learn a prediction model on multi-domain source data such that the model can generalize to a target domain with unknown statistics.
To overcome this problem, we introduce Description-based Embedding, DescEmb, a code-agnostic description-based representation learning framework for predictive modeling on EHR.
The ability to perform causal and counterfactual reasoning are central properties of human intelligence.
As a result, our method can learn the question conditioned visual representations attributed to appearance and motion that show powerful capability for video question answering.
In this paper, we address the problem of separating individual speech signals from videos using audio-visual neural processing.
The goal of video summarization is to select keyframes that are visually diverse and can represent a whole story of an input video.
We present deep networks for context-aware emotion recognition, called CAER-Net, that exploit not only human facial expression but also context information in a joint and boosting manner.
Ranked #7 on Emotion Recognition in Context on EMOTIC