To overcome the limitation of local spatial attention, we propose a point content-based Transformer architecture, called PointConT for short.
Ranked #8 on 3D Point Cloud Classification on ScanObjectNN
Multi-domain image-to-image (I2I) translations can transform a source image according to the style of a target domain.
Generative Neural Radiance Fields (GNeRF) based 3D-aware GANs have demonstrated remarkable capabilities in generating high-quality images while maintaining strong 3D consistency.
In this paper, we propose a real-time face detector based on the one-stage detector YOLOv5, named YOLO-FaceV2.
In this paper, we explore solving jigsaw puzzle as a self-supervised auxiliary loss in ViT for image classification, named Jigsaw-ViT.
Ranked #1 on Learning with noisy labels on ANIMAL
In this work, we propose a different and complementary direction, in which a local bias is introduced using an auxiliary self-supervised task, performed jointly with standard supervised training.
In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded.
MAGIC is a flexible framework and is theoretically compatible with any text generation tasks that incorporate image grounding.
Current practices in metric evaluation focus on one single dataset, e. g., Newstest dataset in each year's WMT Metrics Shared Task.
The ISF manipulates the semantics of an input latent code to make the image generated from it lying in the desired visual domain.
In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation.
This task encourages the VTs to learn spatial relations within an image and makes the VT training much more robust when training data are scarce.
An important aspect of developing dialogue systems is how to evaluate and compare the performance of different systems.
To address this issue, we propose an adversarial shape learning network (ASLNet) to model the building shape patterns that improve the accuracy of building segmentation.
Inspired by recent works on image inpainting, our proposed method leverages the semantic segmentation to model the content and structure of the image, and learn the best shape and location of the object to insert.
In this paper we propose the use of an image retrieval system to assist the image-to-image translation task.
Our proposed model disentangles the image content from the visual attributes, and it learns to modify the latter using the textual description, before generating a new image from the content and the modified attribute representation.
Unsupervised image-to-image translation (UNIT) aims at learning a mapping between several visual domains by using unpaired training images.
PIGAT introduces the attention mechanism to consider the importance of each interacted user/item to both the user and the item, which captures user interests, item attractions and their influence on the recommendation context.
In this work, we propose a novel GAN architecture that decouples the required annotations into a category label - that specifies the gesture type - and a simple-to-draw category-independent conditional map - that expresses the location, rotation and size of the hand gesture.
This paper presents a very simple but efficient algorithm for 3D line segment detection from large scale unorganized point cloud.
We observe that in the conversation tasks, each query could have multiple responses, which forms a 1-to-n or m-to-n relationship in the view of the total corpus.
We develop a novel deep contour detection algorithm with a top-down fully convolutional encoder-decoder network.