In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models.
The proposed framework consists of a large language model (LLM), a diffusion-based image generator, and a series of visual rewards by design.
The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process.
Existing works attempt to solve the problem by explicitly imposing uncertainty on classifiers when OOD inputs are exposed to the classifier during training.
In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue.
BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks.
Vision Transformer has shown great visual representation power in substantial vision tasks such as recognition and detection, and thus been attracting fast-growing efforts on manually designing more effective architectures.
Given a sequence of style tokens, the TokenGAN is able to control the image synthesis by assigning the styles to the content tokens by attention mechanism with a Transformer.
To solve the partial visual confusion issue, we propose to leverage the carried context information of context reference, which is the concentric bigger box of each region proposal, to perform more accurate region classification and regression.
In this paper, we construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning, referred to as LORAC.
We then propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE).
Ranked #141 on Object Detection on COCO minival
First, we calculate full-body anthropometric parameters from limited user inputs by imputation technique, and thus essential anthropometric parameters for 3D body reshaping can be obtained.
For improving texture synthesis, we enhance the discriminator of AOT-GAN by training it with a tailored mask-prediction task.
Ranked #6 on Image Inpainting on Places2
In this paper, we propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
Ranked #5 on Seeing Beyond the Visible on KITTI360-EX
To solve these problems, we propose an automated structure nomenclature standardization framework, 3D Non-local Network with Voting (3DNNV).
Interestingly, the principal component analysis exactly provides an effective way to define such a frame, i. e. setting the principal components as the frame axes.
We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations.
We study on weakly-supervised object detection (WSOD)which plays a vital role in relieving human involvement fromobject-level annotations.
The problem of distance metric learning is mostly considered from the perspective of learning an embedding space, where the distances between pairs of examples are in correspondence with a similarity metric.
Deep models are capable of fitting complex high dimensional functions while usually yielding large computation load.
Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations.
As the missing content can be filled by attention transfer from deep to shallow in a pyramid fashion, both visual and semantic coherence for image inpainting can be ensured.
However, such devices only provide sparse(limited speckles in structured light system) and noisy 3D data which can not support face recognition directly.
In this paper, we consider a typical image blind denoising problem, which is to remove unknown noise from noisy images.
A valid question is how to temporally localize and then describe events, which is known as "dense video captioning."
However, we observe that directly feeding the hallucinated facial images into recog- nition models can even degrade the recognition performance despite the much better visualization quality.
In this paper, we propose a method to automatically and incrementally construct datasets from massive weakly labeled data of the target domain which are readily available on the Internet under the help of a pretrained face model.
In this paper, we propose an RGB-D camera localization approach which takes an effective geometry constraint, i. e. silhouette consistency, into consideration.
Identifying the same individual across different scenes is an important yet difficult task in intelligent video surveillance.
Ranked #9 on Person Re-Identification on SYSU-30k (using extra training data)
We present a novel global stereo model designed for view interpolation.