no code implementations • 31 Dec 2024 • Jianjie Luo, Jingwen Chen, Yehao Li, Yingwei Pan, Jianlin Feng, Hongyang Chao, Ting Yao
Additionally, to facilitate the model training with synthetic data, a novel CLIP-weighted cross-entropy loss is devised to prioritize the high-quality image-text pairs over the low-quality counterparts.
1 code implementation • 12 Sep 2024 • Yifu Chen, Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Zhineng Chen, Tao Mei
In this paper, we propose to decompose the typical single-stage object inpainting into two cascaded processes: 1) semantic pre-inpainting that infers the semantic features of desired objects in a multi-modal feature space; 2) high-fieldity object generation in diffusion latent space that pivots on such inpainted semantic features.
1 code implementation • 12 Sep 2024 • Siqi Wan, Yehao Li, Jingwen Chen, Yingwei Pan, Ting Yao, Yang Cao, Tao Mei
To address this, we shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process with amplified guidance of both basic visual appearance and detailed textures (i. e., high-frequency details) derived from the given garment.
1 code implementation • 11 Sep 2024 • Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei
Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness.
no code implementations • 11 Sep 2024 • Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Zuxuan Wu, Yu-Gang Jiang, Tao Mei
In the fine stage, DreamMesh jointly manipulates the mesh and refines the texture map, leading to high-quality triangle meshes with high-fidelity textured materials.
1 code implementation • 28 Jun 2024 • Jingtao Zhan, Qingyao Ai, Yiqun Liu, Yingwei Pan, Ting Yao, Jiaxin Mao, Shaoping Ma, Tao Mei
Such a prompt refinement process is analogous to translating the prompt from "user languages" into "system languages".
no code implementations • CVPR 2024 • Yurui Qian, Qi Cai, Yingwei Pan, Yehao Li, Ting Yao, Qibin Sun, Tao Mei
Diffusion models have recently brought a powerful revolution in image generation.
no code implementations • CVPR 2024 • Zhongwei Zhang, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Ting Yao, Yang Cao, Tao Mei
Next, TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning, thereby easing the learning of the residual noise for each frame.
no code implementations • CVPR 2024 • Yang Chen, Yingwei Pan, Haibo Yang, Ting Yao, Tao Mei
In this work, we introduce a novel Visual Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the visual appearance knowledge in 2D visual prompt to boost text-to-3D generation.
no code implementations • CVPR 2024 • Rui Zhu, Yingwei Pan, Yehao Li, Ting Yao, Zhenglong Sun, Tao Mei, Chang Wen Chen
Despite this progress, mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process, resulting in sub-optimal training of DiT.
no code implementations • 18 Mar 2024 • Ting Yao, Yehao Li, Yingwei Pan, Tao Mei
Instead, we present a new hybrid backbone with HIgh-Resolution Inputs (namely HIRI-ViT), that upgrades prevalent four-stage ViT to five-stage ViT tailored for high-resolution inputs.
no code implementations • 9 Nov 2023 • Yang Chen, Yingwei Pan, Yehao Li, Ting Yao, Tao Mei
In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch.
no code implementations • 9 Nov 2023 • Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei
To achieve this, we present a new diffusion model (ControlStyle) via upgrading a pre-trained text-to-image model with a trainable modulation network enabling more conditions of text prompts and style images.
1 code implementation • 9 Nov 2023 • Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Tao Mei
In this work, we propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models.
no code implementations • CVPR 2023 • ZiCheng Zhang, Yinglu Liu, Congying Han, Yingwei Pan, Tiande Guo, Ting Yao
Simply coupling NeRF with photorealistic style transfer (PST) will result in cross-view inconsistency and degradation of stylized view syntheses.
no code implementations • CVPR 2023 • Sanqing Qu, Yingwei Pan, Guang Chen, Ting Yao, Changjun Jiang, Tao Mei
We validate the superiority of our MAD in a variety of single-DG scenarios with different modalities, including recognition on 1D texts, 2D images, 3D point clouds, and semantic segmentation on 2D images.
no code implementations • ICCV 2023 • Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, Ting Yao, Tao Mei
Then, we build the geometric correspondence between 2D planes and 3D meshes by rasterization, and project the estimated object regions into 3D explicit object surfaces by aggregating the object information across multiple views.
no code implementations • ICCV 2023 • Qi Cai, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei
Recent progress on multi-modal 3D object detection has featured BEV (Bird-Eye-View) based fusion, which effectively unifies both LiDAR point clouds and camera images in a shared BEV space.
1 code implementation • CVPR 2023 • Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, Chang-Wen Chen
Specifically, cheap scene graph supervision data can be easily obtained by parsing image language descriptions into semantic graphs.
no code implementations • CVPR 2023 • Ting Yao, Yehao Li, Yingwei Pan, Tao Mei
Next, as every two neighbor edges compose a surface, we obtain the edge-level representation of each anchor edge via surface-to-edge aggregation over all neighbor surfaces.
1 code implementation • CVPR 2023 • Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Jianlin Feng, Hongyang Chao, Tao Mei
The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process.
1 code implementation • 15 Nov 2022 • Qi Cai, Yingwei Pan, Ting Yao, Tao Mei
Recent progress on 2D object detection has featured Cascade RCNN, which capitalizes on a sequence of cascade detectors to progressively improve proposal quality, towards high-quality object detection.
1 code implementation • 15 Nov 2022 • Zhaofan Qiu, Yehao Li, Yu Wang, Yingwei Pan, Ting Yao, Tao Mei
In this paper, we propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net.
1 code implementation • 15 Nov 2022 • Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei
The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes.
1 code implementation • 26 Sep 2022 • Jingyang Lin, Yu Wang, Qi Cai, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei
Existing works attempt to solve the problem by explicitly imposing uncertainty on classifiers when OOD inputs are exposed to the classifier during training.
2 code implementations • 11 Jul 2022 • Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, Tao Mei
Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.
Ranked #226 on Image Classification on ImageNet
1 code implementation • 11 Jul 2022 • Ting Yao, Yehao Li, Yingwei Pan, Yu Wang, Xiao-Ping Zhang, Tao Mei
Dual-ViT is henceforth able to reduce the computational complexity without compromising much accuracy.
1 code implementation • CVPR 2022 • Yehao Li, Yingwei Pan, Ting Yao, Tao Mei
In this paper, we propose a new recipe of Transformer-style structure, namely Comprehending and Ordering Semantics Networks (COS-Net), that novelly unifies an enriched semantic comprehending and a learnable semantic ordering processes into a single architecture.
1 code implementation • CVPR 2022 • Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Jiebo Luo, Tao Mei
In this paper, we present a new recipe of inter-frame attention block, namely Stand-alone Inter-Frame Attention (SIFA), that novelly delves into the deformation across frames to estimate local self-attention on each spatial location.
Ranked #14 on Action Recognition on Something-Something V1
1 code implementation • CVPR 2022 • Yong Zhang, Yingwei Pan, Ting Yao, Rui Huang, Tao Mei, Chang-Wen Chen
Such design decomposes the process of HOI set prediction into two subsequent phases, i. e., an interaction proposal generation is first performed, and then followed by transforming the non-parametric interaction proposals into HOI predictions via a structure-aware Transformer.
Ranked #3 on Human-Object Interaction Detection on V-COCO
1 code implementation • 13 Jun 2022 • Yingwei Pan, Yehao Li, Yiheng Zhang, Qi Cai, Fuchen Long, Zhaofan Qiu, Ting Yao, Tao Mei
This paper presents an overview and comparative analysis of our systems designed for the following two tracks in SAPIEN ManiSkill Challenge 2021: No Interaction Track: The No Interaction track targets for learning policies from pre-collected demonstration trajectories.
no code implementations • 26 Mar 2022 • Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, Jing Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui, Lingxiao Huang, Zheng Liang, HuaWei Shen, HUI ZHANG, Quanshi Zhang, Qingxiu Dong, Zhixing Tan, Mingxuan Wang, Shuo Wang, Long Zhou, Haoran Li, Junwei Bao, Yingwei Pan, Weinan Zhang, Zhou Yu, Rui Yan, Chence Shi, Minghao Xu, Zuobai Zhang, Guoqiang Wang, Xiang Pan, Mengjie Li, Xiaoyu Chu, Zijun Yao, Fangwei Zhu, Shulin Cao, Weicheng Xue, Zixuan Ma, Zhengyan Zhang, Shengding Hu, Yujia Qin, Chaojun Xiao, Zheni Zeng, Ganqu Cui, Weize Chen, Weilin Zhao, Yuan YAO, Peng Li, Wenzhao Zheng, Wenliang Zhao, Ziyi Wang, Borui Zhang, Nanyi Fei, Anwen Hu, Zenan Ling, Haoyang Li, Boxi Cao, Xianpei Han, Weidong Zhan, Baobao Chang, Hao Sun, Jiawen Deng, Chujie Zheng, Juanzi Li, Lei Hou, Xigang Cao, Jidong Zhai, Zhiyuan Liu, Maosong Sun, Jiwen Lu, Zhiwu Lu, Qin Jin, Ruihua Song, Ji-Rong Wen, Zhouchen Lin, LiWei Wang, Hang Su, Jun Zhu, Zhifang Sui, Jiajun Zhang, Yang Liu, Xiaodong He, Minlie Huang, Jian Tang, Jie Tang
With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm.
no code implementations • CVPR 2021 • Dong Li, Zhaofan Qiu, Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei
For each action category, we execute online clustering to decompose the graph into sub-graphs on each scale through learning Gaussian Mixture Layer and select the discriminative sub-graphs as action prototypes for recognition.
no code implementations • 11 Jan 2022 • Yehao Li, Jiahao Fan, Yingwei Pan, Ting Yao, Weiyao Lin, Tao Mei
Vision-language pre-training has been an emerging and fast-developing research topic, which transfers multi-modal knowledge from rich-resource pre-training task to limited-resource downstream tasks.
no code implementations • 11 Jan 2022 • Yingwei Pan, Yue Chen, Qian Bao, Ning Zhang, Ting Yao, Jingen Liu, Tao Mei
To our best knowledge, our system is the first end-to-end automated directing system for multi-camera sports broadcasting, completely driven by the semantic understanding of sports events.
1 code implementation • NeurIPS 2021 • Yu Wang, Jingyang Lin, Jingjing Zou, Yingwei Pan, Ting Yao, Tao Mei
Our work reveals a structured shortcoming of the existing mainstream self-supervised learning methods.
no code implementations • 14 Dec 2021 • Yang Chen, Yingwei Pan, Yu Wang, Ting Yao, Xinmei Tian, Tao Mei
From this point, we present a particular paradigm of self-supervised learning tailored for domain adaptation, i. e., Transferrable Contrastive Learning (TCL), which links the SSL and the desired cross-domain transferability congruently.
1 code implementation • 14 Dec 2021 • Jingyang Lin, Yingwei Pan, Rongfeng Lai, Xuehang Yang, Hongyang Chao, Ting Yao
In this work, we quantitatively analyze the sub-text problem and present a simple yet effective design, COntrastive RElation (CORE) module, to mitigate that issue.
no code implementations • 14 Dec 2021 • Jianjie Luo, Yehao Li, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei
BERT-type structure has led to the revolution of vision-language pre-training and the achievement of state-of-the-art results on numerous vision-language downstream tasks.
no code implementations • ICCV 2021 • Yang Chen, Yu Wang, Yingwei Pan, Ting Yao, Xinmei Tian, Tao Mei
Correspondingly, we also propose a novel "jury" mechanism, which is particularly effective in learning useful semantic feature commonalities among domains.
Ranked #45 on Domain Generalization on PACS
2 code implementations • 18 Aug 2021 • Yehao Li, Yingwei Pan, Jingwen Chen, Ting Yao, Tao Mei
Nevertheless, there has not been an open-source codebase in support of training and deploying numerous neural network models for cross-modal analytics in a unified and modular fashion.
no code implementations • 5 Aug 2021 • Yu Wang, Jingyang Lin, Qi Cai, Yingwei Pan, Ting Yao, Hongyang Chao, Tao Mei
In this paper, we construct a novel probabilistic graphical model that effectively incorporates the low rank promoting prior into the framework of contrastive learning, referred to as LORAC.
7 code implementations • 26 Jul 2021 • Yehao Li, Ting Yao, Yingwei Pan, Tao Mei
Such design fully capitalizes on the contextual information among input keys to guide the learning of dynamic attention matrix and thus strengthens the capacity of visual representation.
Ranked #309 on Image Classification on ImageNet
1 code implementation • 27 Jan 2021 • Yehao Li, Yingwei Pan, Ting Yao, Jingwen Chen, Tao Mei
Despite having impressive vision-language (VL) pretraining with BERT-based encoder for VL understanding, the pretraining of a universal encoder-decoder for both VL understanding and generation remains challenging.
1 code implementation • NeurIPS 2020 • Qi Cai, Yu Wang, Yingwei Pan, Ting Yao, Tao Mei
This paper explores useful modifications of the recent development in contrastive learning via novel probabilistic modeling.
3 code implementations • 3 Aug 2020 • Ting Yao, Yiheng Zhang, Zhaofan Qiu, Yingwei Pan, Tao Mei
In this paper, we compose a trilogy of exploring the basic and generic supervision in the sequence from spatial, spatiotemporal and sequential perspectives.
no code implementations • 27 Jul 2020 • Yingwei Pan, Jun Xu, Yehao Li, Ting Yao, Tao Mei
The Pre-training for Video Captioning Challenge 2020 Summary: results and challenge participants' technical reports.
1 code implementation • 7 Jul 2020 • Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, Tao Mei
Single shot detectors that are potentially faster and simpler than two-stage detectors tend to be more applicable to object detection in videos.
no code implementations • 5 Jul 2020 • Yingwei Pan, Yehao Li, Jianjie Luo, Jun Xu, Ting Yao, Tao Mei
In this work, we present Auto-captions on GIF, which is a new large-scale pre-training dataset for generic video understanding.
no code implementations • CVPR 2020 • Yingwei Pan, Ting Yao, Yehao Li, Chong-Wah Ngo, Tao Mei
A clustering branch is capitalized on to ensure that the learnt representation preserves such underlying structure by matching the estimated assignment distribution over clusters to the inherent cluster distribution for each target sample.
1 code implementation • CVPR 2020 • Qi Cai, Yingwei Pan, Yu Wang, Jingen Liu, Ting Yao, Tao Mei
To this end, we devise a general loss function to cover most region-based object detectors with various sampling strategies, and then based on it we propose a unified sample weighting network to predict a sample's task weights.
2 code implementations • CVPR 2020 • Yingwei Pan, Ting Yao, Yehao Li, Tao Mei
Recent progress on fine-grained visual recognition and visual question answering has featured Bilinear Pooling, which effectively models the 2$^{nd}$ order interactions across multi-modal inputs.
Ranked #21 on Image Captioning on COCO Captions
2 code implementations • 8 Oct 2019 • Yingwei Pan, Yehao Li, Qi Cai, Yang Chen, Ting Yao
Semi-Supervised Domain Adaptation: For this task, we adopt a standard self-learning framework to construct a classifier based on the labeled source and target data, and generate the pseudo labels for unlabeled target data.
no code implementations • 9 Sep 2019 • Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei
The problem of distance metric learning is mostly considered from the perspective of learning an embedding space, where the distances between pairs of examples are in correspondence with a similarity metric.
no code implementations • ICCV 2019 • Ting Yao, Yingwei Pan, Yehao Li, Tao Mei
It is always well believed that parsing an image into constituent visual patterns would be helpful for understanding and representing an image.
no code implementations • 26 Aug 2019 • Yang Chen, Yingwei Pan, Ting Yao, Xinmei Tian, Tao Mei
Unsupervised image-to-image translation is the task of translating an image from one domain to another in the absence of any paired training examples and tends to be more applicable to practical applications.
2 code implementations • ICCV 2019 • Jiajun Deng, Yingwei Pan, Ting Yao, Wengang Zhou, Houqiang Li, Tao Mei
In this paper, we introduce a new design to capture the interactions across the objects in spatio-temporal context.
1 code implementation • 16 Aug 2019 • Jianhao Zhang, Yingwei Pan, Ting Yao, He Zhao, Tao Mei
It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations.
no code implementations • 1 Aug 2019 • Jing Wang, Yingwei Pan, Ting Yao, Jinhui Tang, Tao Mei
A valid question is how to encapsulate such gists/topics that are worthy of mention from an image, and then describe the image from one topic to another but holistically with a coherent structure.
no code implementations • 20 Jun 2019 • Fuchen Long, Qi Cai, Zhaofan Qiu, Zhijian Hou, Yingwei Pan, Ting Yao, Chong-Wah Ngo
This notebook paper presents an overview and comparative analysis of our system designed for activity detection in extended videos (ActEV-PC) in ActivityNet Challenge 2019.
no code implementations • 14 Jun 2019 • Zhaofan Qiu, Dong Li, Yehao Li, Qi Cai, Yingwei Pan, Ting Yao
This notebook paper presents an overview and comparative analysis of our systems designed for the following three tasks in ActivityNet Challenge 2019: trimmed action recognition, dense-captioning events in videos, and spatio-temporal action localization.
1 code implementation • 3 May 2019 • Jingwen Chen, Yingwei Pan, Yehao Li, Ting Yao, Hongyang Chao, Tao Mei
Moreover, the inherently recurrent dependency in RNN prevents parallelization within a sequence during training and therefore limits the computations.
1 code implementation • CVPR 2019 • Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, Ling-Yu Duan, Ting Yao
The whole architecture is then optimized with three consistency regularizations: 1) region-level consistency to align the region-level predictions between teacher and student, 2) inter-graph consistency for matching the graph structures between teacher and student, and 3) intra-graph consistency to enhance the similarity between regions of same class within the graph of student.
no code implementations • CVPR 2019 • Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei
Image captioning has received significant attention with remarkable improvements in recent advances.
no code implementations • CVPR 2019 • Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, Tao Mei
Specifically, we present Transferrable Prototypical Networks (TPN) for adaptation such that the prototypes for each class in source and target domains are close in the embedding space and the score distributions predicted by prototypes separately on source and target data are similar.
no code implementations • ECCV 2018 • Ting Yao, Yingwei Pan, Yehao Li, Tao Mei
Technically, we build graphs over the detected objects in an image based on their spatial and semantic connections.
no code implementations • CVPR 2018 • Qi Cai, Yingwei Pan, Ting Yao, Chenggang Yan, Tao Mei
In this paper, we introduce the new ideas of augmenting Convolutional Neural Networks (CNNs) with Memory and learning to learn the network parameters for the unlabelled images on the fly in one-shot learning.
no code implementations • 23 Apr 2018 • Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, Tao Mei
In this paper, we present a novel Temporal GANs conditioning on Captions, namely TGANs-C, in which the input to the generator network is a concatenation of a latent noise vector and caption embedding, and then is transformed into a frame sequence with 3D spatio-temporal convolutions.
no code implementations • 23 Apr 2018 • Zhaofan Qiu, Yingwei Pan, Ting Yao, Tao Mei
Specifically, a novel deep semantic hashing with GANs (DSH-GANs) is presented, which mainly consists of four components: a deep convolution neural networks (CNN) for learning image representations, an adversary stream to distinguish synthetic images from real ones, a hash stream for encoding image representations to hash codes and a classification stream.
no code implementations • CVPR 2018 • Yehao Li, Ting Yao, Yingwei Pan, Hongyang Chao, Tao Mei
A valid question is how to temporally localize and then describe events, which is known as "dense video captioning."
no code implementations • CVPR 2017 • Ting Yao, Yingwei Pan, Yehao Li, Tao Mei
Image captioning often requires a large set of training image-sentence pairs.
no code implementations • CVPR 2017 • Yingwei Pan, Ting Yao, Houqiang Li, Tao Mei
Automatically generating natural language descriptions of videos plays a fundamental challenge for computer vision community.
no code implementations • ICCV 2017 • Ting Yao, Yingwei Pan, Yehao Li, Zhaofan Qiu, Tao Mei
Automatically describing an image with a natural language has been an emerging challenge in both fields of computer vision and natural language processing.
no code implementations • CVPR 2015 • Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, Tao Mei
In many real-world applications, we are often facing the problem of cross domain learning, i. e., to borrow the labeled data or transfer the already learnt knowledge from a source domain to a target domain.
no code implementations • CVPR 2016 • Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, Yong Rui
Our proposed LSTM-E consists of three components: a 2-D and/or 3-D deep convolutional neural networks for learning powerful video representation, a deep RNN for generating sentences, and a joint embedding model for exploring the relationships between visual content and sentence semantics.