We introduce a novel ranking network that utilizes the Co-Attention between movies and trailers as guidance to generate the training pairs, where the moments highly corrected with trailers are expected to be scored higher than the uncorrelated moments.
Then, we extend the baseline model to a prompt-based learning approach, PromptMR, for all-in-one MRI reconstruction from different views, contrasts, adjacent types, and acceleration factors.
Accurate 3D shape abstraction from a single 2D image is a long-standing problem in computer vision and graphics.
The task of shape abstraction with semantic part consistency is challenging due to the complex geometries of natural objects.
However, current cardiac MRI-based reconstruction technology used in clinical settings is 2D with limited through-plane resolution, resulting in low-quality reconstructed cardiac volumes.
Second, a split-and-fusion (SAF) head is designed to remove the noise in localization of PLs, which is usually ignored in existing methods.
The accurate diagnosis on pathological subtypes for lung cancer is of significant importance for the follow-up treatments and prognosis managements.
Survival outcome assessment is challenging and inherently associated with multiple clinical factors (e. g., imaging and genomics biomarkers) in cancer.
Inspired by the training of medical residents, we explore universal medical image segmentation, whose goal is to learn from diverse medical imaging sources covering a range of clinical targets, body regions, and image modalities.
Tooth segmentation from intraoral scans is a crucial part of digital dentistry.
However, direct aligning cross-modal information using such representations is challenging, as visual patches and text tokens differ in semantic levels and granularities.
The supervised learning of the proposed method extracts features from limited labeled data in each client, while the unsupervised data is used to distill both feature and response-based knowledge from a national data repository to further improve the accuracy of the collaborative model and reduce the communication cost.
Prototype, as a representation of class embeddings, has been explored to reduce memory footprint or mitigate forgetting for continual learning scenarios.
Summarizing of the mesh cell/triangle in this manner imposes an implicit structural constraint and makes it difficult to work with multiple resolutions which is done in many point cloud based deep learning algorithms.
First, a disentangle network is proposed to decompose an image into a domain-invariant anatomical representation and a domain-specific style code, where the former is sent to a segmentation model that is not affected by the domain shift, and the disentangle network is regularized by a decoder that combines the anatomical and style codes to reconstruct the input image.
First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation.
Ranked #2 on Domain Adaptation on VisDA2017
Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult.
In this work, we propose the self-supervised and weight-preserving neural architecture search (SSWP-NAS) as an extension of the current NAS framework by allowing the self-supervision and retaining the concomitant weights discovered during the search stage.
Transformers have demonstrated remarkable performance in natural language processing and computer vision.
We propose a reinforcement learning based approach to query object localization, for which an agent is trained to localize objects of interest specified by a small exemplary set.
Magnetic Resonance (MR) image reconstruction from highly undersampled $k$-space data is critical in accelerated MR imaging (MRI) techniques.
Deep learning-based medical image segmentation has shown the potential to reduce manual delineation efforts, but it still requires a large-scale fine annotated dataset for training, and there is a lack of large-scale datasets covering the whole abdomen region with accurate and detailed annotations for the whole abdominal organ segmentation.
In this paper, we attenuate this need, by introducing an end-to-end SLT model that does not entail explicit use of glosses; the model only needs text groundtruth.
To handle this problem, we propose a hybrid supervision learning framework for this kind of high resolution images with sufficient image-level coarse annotations and a few pixel-level fine labels.
To deal with this problem, in this paper, we propose an object-guided instance segmentation method.
Attention-based models, exemplified by the Transformer, can effectively model long range dependency, but suffer from the quadratic complexity of self-attention operation, making them difficult to be adopted for high-resolution image generation based on Generative Adversarial Networks (GANs).
Ranked #2 on Image Generation on CelebA 256x256 (FID metric)
Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities.
We introduce a motion generator that discovers the desired trajectory, in which content and motion are disentangled.
Ranked #21 on Video Generation on UCF-101
To overcome these problems, we propose a 3D sphere representation-based center-points matching detection network that is anchor-free and automatically predicts the position, radius, and offset of nodules without the manual design of nodule/anchor parameters.
In the animation industry, cartoon videos are usually produced at low frame rate since hand drawing of such frames is costly and time-consuming.
Radiotherapy is a treatment where radiation is used to eliminate cancer cells.
Memory-efficient continuous Sign Language Translation is a significant challenge for the development of assisted technologies with real-time applicability for the deaf.
The importance of this subject is nested in the amount of training data that artificial neural networks need to accurately identify and segment objects in images and the infeasibility of acquiring a sufficient dataset within the biomedical field.
CrossNorm exchanges styles between feature channels to perform style augmentation, diversifying the content and style mixtures.
As deep learning technologies advance, increasingly more data is necessary to generate general and robust models for various tasks.
To alleviate such tedious and manual effort, in this paper we propose a novel weakly supervised segmentation framework based on partial points annotation, i. e., only a small portion of nuclei locations in each image are labeled.
Our key idea is to generalize the distilled cross-modal knowledge learned from a Source dataset, which contains paired examples from both modalities, to the Target dataset by modeling knowledge as priors on parameters of the Student.
The comparison results demonstrate the merits of our method in both Cobb angle measurement and landmark detection on low-contrast and ambiguous X-ray images.
Along with the instance normalization, the model is able to recover the target object distribution and suppress the distribution of neighboring attached objects.
Graph kernels are kernel methods measuring graph similarity and serve as a standard tool for graph classification.
Ranked #7 on Link Prediction on Cora
However, effective and efficient delineation of all the knee articular cartilages in large-sized and high-resolution 3D MR knee data is still an open challenge.
In this paper, we propose a new box-based cell instance segmentation method.
We propose a Dynamic Graph-Based Spatial-Temporal Attention (DG-STA) method for hand gesture recognition.
Ranked #3 on Hand Gesture Recognition on SHREC 2017
We present a simple method that achieves unexpectedly superior performance for Complex Reasoning involved Visual Question Answering.
In this paper, we study the problem of learning Graph Convolutional Networks (GCNs) for regression.
Ranked #21 on Monocular 3D Human Pose Estimation on Human3.6M
Recent developments in gradient-based attention modeling have seen attention maps emerge as a powerful tool for interpreting convolutional neural networks.
Generating multi-view images from a single-view input is an essential yet challenging problem.
The ability for computational agents to reason about the high-level content of real world scene images is important for many applications.
We propose a novel method for real-time face alignment in videos based on a recurrent encoder-decoder network model.
We first recover the facial identity and expressions from the video by fitting a face morphable model for each frame.
It remains open to explore duality theory and algorithms in such a non-convex and NP-hard problem setting.
Multispectral pedestrian detection is essential for around-the-clock applications, e. g., surveillance and autonomous driving.
Our approach takes advantage of part-based representation and cascade regression for robust and efficient alignment on each frame.
In this paper, we propose a novel visual tracking framework that intelligently discovers reliable patterns from a wide range of video to resist drift error for long-term tracking tasks.
Face alignment, especially on real-time or large-scale sequential images, is a challenging task with broad applications.