no code implementations • ECCV 2020 • Weitao Wan, Jiansheng Chen, Ming-Hsuan Yang
We call such a new robust training strategy the adversarial training with bi-directional likelihood regularization (ATBLR) method.
no code implementations • ECCV 2020 • Chun-Han Yao, Chen Fang, Xiaohui Shen, Yangyue Wan, Ming-Hsuan Yang
While single-image object detectors can be naively applied to videos in a frame-by-frame fashion, the prediction is often temporally inconsistent.
1 code implementation • 27 Nov 2024 • Li-Yuan Tsao, Hao-Wei Chen, Hao-Wei Chung, Deqing Sun, Chun-Yi Lee, Kelvin C. K. Chan, Ming-Hsuan Yang
Text-to-image diffusion models have emerged as powerful priors for real-world image super-resolution (Real-ISR).
no code implementations • 27 Nov 2024 • Yawei Li, Bin Ren, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Nicu Sebe, Ming-Hsuan Yang, Luca Benini
To strike a balance between efficiency and model capacity for a generalized transformer-based IR method, we propose a hierarchical information flow mechanism for image restoration, dubbed Hi-IR, which progressively propagates information among pixels in a bottom-up manner.
Ranked #1 on Image Deblurring on HIDE (trained on GOPRO)
1 code implementation • 26 Nov 2024 • Zhongyu Xia, Jishuo Li, Zhiwei Lin, Xinhao Wang, Yongtao Wang, Ming-Hsuan Yang
Moreover, we propose a vision-centric 3D open-world object detection baseline and further introduce an ensemble method by fusing general and specialized models to address the issue of lower precision in existing open-world methods for the OpenAD benchmark.
no code implementations • 26 Nov 2024 • Chanyoung Kim, Dayun Ju, Woojung Han, Ming-Hsuan Yang, Seong Jae Hwang
In this work, we introduce a novel approach that overcomes this limitation by incorporating object-level contextual knowledge within images.
no code implementations • Neural Information Processing Systems 2024 • Nitesh Bharadwaj Gundavarapu, Luke Friedman, Raghav Goyal, Chaitra Hegde, Eirikur Agustsson, Sagar M. Waghmare, Mikhail Sirotenko, Ming-Hsuan Yang, Tobias Weyand, Boqing Gong, Leonid Sigal
Nevertheless, the majority of prior works that leverage MAE pre-training have focused on relatively short video representations (16 / 32 frames in length) largely due to hardware memory and compute limitations that scale poorly with video length due to the dense memory-intensive self-attention decoding.
Ranked #1 on Action Recognition on Diving-48 (using extra training data)
1 code implementation • 6 Nov 2024 • Siddharth Seth, Rishabh Dabral, Diogo Luvizon, Marc Habermann, Ming-Hsuan Yang, Christian Theobalt, Adam Kortylewski
We motivate our work by noting that most methods require a parametric model of the human body to ground pose-dependent deformations.
1 code implementation • 31 Oct 2024 • Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, Songyou Peng
We utilize the reconstructed 3D Gaussians for novel view synthesis and pose estimation tasks and propose a two-stage coarse-to-fine pipeline for accurate pose estimation.
no code implementations • 20 Oct 2024 • Junwei Zhou, Xueting Li, Lu Qi, Ming-Hsuan Yang
Starting with a 2D layout provided by a user or generated from a text description, we first create a coarse 3D scene using a carefully designed initialization process based on efficient reconstruction models.
no code implementations • 16 Oct 2024 • Lichang Chen, Hexiang Hu, Mingda Zhang, YiWen Chen, Zifeng Wang, Yandong Li, Pranav Shyam, Tianyi Zhou, Heng Huang, Ming-Hsuan Yang, Boqing Gong
To address this, OmnixR offers two evaluation variants: (1)synthetic subset: a synthetic dataset generated automatically by translating text into multiple modalities--audio, images, video, and hybrids (Omnify).
no code implementations • 15 Oct 2024 • Xirui Li, Charles Herrmann, Kelvin C. K. Chan, Yinxiao Li, Deqing Sun, Chao Ma, Ming-Hsuan Yang
Recent progress in image generation has sparked research into controlling these models through condition signals, with various methods addressing specific challenges in conditional generation.
no code implementations • 15 Oct 2024 • Hsin-Ping Huang, Xinyi Wang, Yonatan Bitton, Hagai Taitelbaum, Gaurav Singh Tomar, Ming-Wei Chang, Xuhui Jia, Kelvin C. K. Chan, Hexiang Hu, Yu-Chuan Su, Ming-Hsuan Yang
Using KITTEN, we conduct a systematic study on the fidelity of entities in text-to-image generation models, focusing on their ability to generate a wide range of real-world visual entities, such as landmark buildings, aircraft, plants, and animals.
1 code implementation • 14 Oct 2024 • Jingzhi Bao, Xueting Li, Ming-Hsuan Yang
In this work, we present Tex4D, a zero-shot approach that integrates inherent 3D geometry knowledge from mesh sequences with the expressiveness of video diffusion models to produce multi-view and temporally consistent 4D textures.
1 code implementation • 7 Oct 2024 • Yujin Tang, Lu Qi, Fei Xie, Xiangtai Li, Chao Ma, Ming-Hsuan Yang
On Moving MNIST, PredFormer achieves a 51. 3% reduction in MSE relative to SimVP.
Ranked #1 on Video Prediction on Moving MNIST
no code implementations • 4 Oct 2024 • Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, Ming-Hsuan Yang
Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision.
no code implementations • 9 Sep 2024 • Henghui Ding, Lingyi Hong, Chang Liu, Ning Xu, Linjie Yang, Yuchen Fan, Deshui Miao, Yameng Gu, Xin Li, Zhenyu He, YaoWei Wang, Ming-Hsuan Yang, Jinming Chai, Qin Ma, Junpei Zhang, Licheng Jiao, Fang Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Xu Liu, Lingling Li, Hao Fang, Feiyu Pan, Xiankai Lu, Wei zhang, Runmin Cong, Tuyen Tran, Bin Cao, Yisi Zhang, Hanyi Wang, Xingjian He, Jing Liu
Despite the promising performance of current video segmentation models on existing benchmarks, these models still struggle with complex scenes.
no code implementations • 29 Aug 2024 • Deshui Miao, Yameng Gu, Xin Li, Zhenyu He, YaoWei Wang, Ming-Hsuan Yang
Video object segmentation (VOS) is a crucial task in computer vision, but current VOS methods struggle with complex scenes and prolonged object motions.
2 code implementations • 17 Aug 2024 • Xin Lin, Yuyan Zhou, Jingtong Yue, Chao Ren, Kelvin C. K. Chan, Lu Qi, Ming-Hsuan Yang
As SE increases computational complexity during inference, we propose a re-boosting module to the SC (Reb-SC) to improve the SC strategy further by incorporating SE into SC without increasing inference time.
no code implementations • 14 Aug 2024 • Seung Hyun Lee, Junjie Ke, Yinxiao Li, Junfeng He, Steven Hickson, Katie Datsenko, Sangpil Kim, Ming-Hsuan Yang, Irfan Essa, Feng Yang
The goal of image cropping is to identify visually appealing crops within an image.
no code implementations • 28 Jul 2024 • Shilin Xu, Xiangtai Li, Haobo Yuan, Lu Qi, Yunhai Tong, Ming-Hsuan Yang
The recent surge in Multimodal Large Language Models (MLLMs) has showcased their remarkable potential for achieving generalized intelligence by integrating visual understanding into Large Language Models. Nevertheless, the sheer model size of MLLMs leads to substantial memory and computational demands that hinder their widespread deployment.
no code implementations • 10 Jul 2024 • Xin Li, Deshui Miao, Zhenyu He, YaoWei Wang, Huchuan Lu, Ming-Hsuan Yang
Tracking and segmenting multiple similar objects with complex or separate parts in long-term videos is inherently challenging due to the ambiguity of target parts and identity confusion caused by occlusion, background clutter, and long-term variations.
1 code implementation • 9 Jul 2024 • Shuangkang Fang, Yufeng Wang, Yi-Hsuan Tsai, Yi Yang, Wenrui Ding, Shuchang Zhou, Ming-Hsuan Yang
Recent work on image content manipulation based on vision-language pre-training models has been effectively extended to text-driven 3D scene editing.
1 code implementation • 27 Jun 2024 • Haobo Yuan, Xiangtai Li, Lu Qi, Tao Zhang, Ming-Hsuan Yang, Shuicheng Yan, Chen Change Loy
Based on the benchmark results, our RWKV-SAM achieves outstanding performance in efficiency and segmentation quality compared to transformers and other linear attention models.
2 code implementations • 24 Jun 2024 • Henghui Ding, Chang Liu, Yunchao Wei, Nikhila Ravi, Shuting He, Song Bai, Philip Torr, Deshui Miao, Xin Li, Zhenyu He, YaoWei Wang, Ming-Hsuan Yang, Zhensong Xu, Jiangtao Yao, Chengjing Wu, Ting Liu, Luoqi Liu, Xinyu Liu, Jing Zhang, Kexin Zhang, Yuting Yang, Licheng Jiao, Shuyuan Yang, Mingqi Gao, Jingnan Luo, Jinyu Yang, Jungong Han, Feng Zheng, Bin Cao, Yisi Zhang, Xuanxu Lin, Xingjian He, Bo Zhao, Jing Liu, Feiyu Pan, Hao Fang, Xiankai Lu
Moreover, we provide a new motion expression guided video segmentation dataset MeViS to study the natural language-guided video understanding in complex environments.
no code implementations • 7 Jun 2024 • Deshui Miao, Xin Li, Zhenyu He, YaoWei Wang, Ming-Hsuan Yang
In this challenge, we propose a semantic embedding video object segmentation model and use the salient features of objects as query representations.
1 code implementation • 30 May 2024 • Chaoyang Wang, Xiangtai Li, Lu Qi, Henghui Ding, Yunhai Tong, Ming-Hsuan Yang
For image synthesis, we propose a finite perturbation approach to enhance the diversity of generated results without changing the semantic categories.
no code implementations • 30 May 2024 • Bin Ren, Yawei Li, Jingyun Liang, Rakesh Ranjan, Mengyuan Liu, Rita Cucchiara, Luc van Gool, Ming-Hsuan Yang, Nicu Sebe
Additionally, for IR, it is commonly noted that small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process, as they contribute essential contextual cues crucial for accurate reconstruction.
1 code implementation • 27 May 2024 • Kuan-Chih Huang, Xiangtai Li, Lu Qi, Shuicheng Yan, Ming-Hsuan Yang
This foundational estimation facilitates a detailed, coarse-to-fine segmentation strategy that significantly enhances the precision of object identification and segmentation.
1 code implementation • 23 May 2024 • Lingshun Kong, Jiangxin Dong, Ming-Hsuan Yang, Jinshan Pan
Convolutional neural networks (CNNs) and Vision Transformers (ViTs) have achieved excellent performance in image restoration.
1 code implementation • 17 May 2024 • I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Ming-Hsuan Yang, Sy-Yen Kuo
To address this issue, we introduce an effective approach to stabilize the proposal-target matching in point-based methods.
Ranked #1 on Crowd Counting on NWPU-Crowd
no code implementations • CVPR 2024 • Kelvin C. K. Chan, Yang Zhao, Xuhui Jia, Ming-Hsuan Yang, Huisheng Wang
In subject-driven text-to-image synthesis, the synthesis process tends to be heavily influenced by the reference images provided by users, often overlooking crucial attributes detailed in the text prompt.
1 code implementation • CVPR 2024 • Chengxu Liu, Xuan Wang, Xiangyu Xu, Ruhao Tian, Shuai Li, Xueming Qian, Ming-Hsuan Yang
In particular, we use a motion estimation network to capture motion information from neighborhoods, thereby adaptively estimating spatially-variant motion flow, mask, kernels, weights, and offsets to obtain the MISC Filter.
no code implementations • 17 Apr 2024 • Hao-Wei Chen, Yu-Syuan Xu, Kelvin C. K. Chan, Hsien-Kai Kuo, Chun-Yi Lee, Ming-Hsuan Yang
Towards this goal, we propose AdaIR, a novel framework that enables low storage cost and efficient training without sacrificing performance.
no code implementations • 15 Apr 2024 • Chieh Hubert Lin, Changil Kim, Jia-Bin Huang, Qinbo Li, Chih-Yao Ma, Johannes Kopf, Ming-Hsuan Yang, Hung-Yu Tseng
These two problems are further reinforced with the use of pixel-distance losses.
1 code implementation • 15 Apr 2024 • Yu-Ju Tsai, Jin-Cheng Jhang, Jingjing Zheng, Wei Wang, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo, Ming-Hsuan Yang
A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions.
no code implementations • 11 Apr 2024 • Weijie Lyu, Xueting Li, Abhijit Kundu, Yi-Hsuan Tsai, Ming-Hsuan Yang
We introduce Gaga, a framework that reconstructs and segments open-world 3D scenes by leveraging inconsistent 2D masks predicted by zero-shot segmentation models.
no code implementations • 9 Apr 2024 • Deshui Miao, Xin Li, Zhenyu He, Huchuan Lu, Ming-Hsuan Yang
In addition, we propose a spatial-temporal memory to assist feature association and temporal ID assignment and correlation.
no code implementations • 9 Apr 2024 • Pin-Hung Kuo, Jinshan Pan, Shao-Yi Chien, Ming-Hsuan Yang
By retaining partial information in additional dimensions independent from the self-attention calculations, our method effectively captures global contextual representations with complexity linear to the image size.
1 code implementation • 3 Apr 2024 • Zhongyu Xia, Zhiwei Lin, Xinhao Wang, Yongtao Wang, Yun Xing, Shengxiang Qi, Nan Dong, Ming-Hsuan Yang
Three-dimensional perception from multi-view cameras is a crucial component in autonomous driving systems, which involves multiple tasks like 3D object detection and bird's-eye-view (BEV) semantic segmentation.
1 code implementation • 2 Apr 2024 • Akshay Dudhane, Omkar Thawakar, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang
All-in-one image restoration tackles different types of degradations with a unified model instead of having task-specific, non-generic models for each degradation.
1 code implementation • CVPR 2024 • Yuqing Huang, Xin Li, Zikun Zhou, YaoWei Wang, Zhenyu He, Ming-Hsuan Yang
Upon the PN tree memory, we develop corresponding walking rules for determining the state of the target and define a set of control flows to unite the tracker and the detector in different tracking scenarios.
Ranked #1 on Visual Object Tracking on VideoCube
1 code implementation • 26 Mar 2024 • Abdelrahman Shaker, Syed Talal Wasim, Martin Danelljan, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
Recently, transformer-based approaches have shown promising results for semi-supervised video object segmentation.
1 code implementation • CVPR 2024 • Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Ren, Ming-Hsuan Yang, Sergey Tulyakov
Next, we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation.
no code implementations • 27 Feb 2024 • Hankyul Kang, Ming-Hsuan Yang, Jongbin Ryu
In this work, we propose an effective method to decompose the attention operation into query- and key-less components.
no code implementations • 21 Feb 2024 • Zhengxue Wang, Zhiqiang Yan, Ming-Hsuan Yang, Jinshan Pan, Guangwei Gao, Ying Tai, Jian Yang
Specifically, we design an All-in-one Prior Propagation that computes the similarity between multi-modal scene priors, i. e., RGB, normal, semantic, and depth, to reduce the texture interference.
no code implementations • 20 Feb 2024 • Long Zhao, Nitesh B. Gundavarapu, Liangzhe Yuan, Hao Zhou, Shen Yan, Jennifer J. Sun, Luke Friedman, Rui Qian, Tobias Weyand, Yue Zhao, Rachel Hornung, Florian Schroff, Ming-Hsuan Yang, David A. Ross, Huisheng Wang, Hartwig Adam, Mikhail Sirotenko, Ting Liu, Boqing Gong
We introduce VideoPrism, a general-purpose video encoder that tackles diverse video understanding tasks with a single frozen model.
1 code implementation • 20 Feb 2024 • Gaoxiang Cong, Yuankai Qi, Liang Li, Amin Beheshti, Zhedong Zhang, Anton Van Den Hengel, Ming-Hsuan Yang, Chenggang Yan, Qingming Huang
Given a script, the challenge in Movie Dubbing (Visual Voice Cloning, V2C) is to generate speech that aligns well with the video in both time and emotion, based on the tone of a reference audio track.
no code implementations • 16 Feb 2024 • Divin Yan, Lu Qi, Vincent Tao Hu, Ming-Hsuan Yang, Meng Tang
To address the observed appearance overlap between synthesized images of rare classes and tail classes, we propose a method based on contrastive learning to minimize the overlap between distributions of synthetic images for different classes.
no code implementations • 11 Feb 2024 • Xiaoyu Zhou, Xingjian Ran, Yajiao Xiong, Jinlin He, Zhiwei Lin, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang
We present GALA3D, generative 3D GAussians with LAyout-guided control, for effective compositional text-to-3D generation.
1 code implementation • 4 Feb 2024 • Tao Wang, Wanglong Lu, Kaihao Zhang, Wenhan Luo, Tae-Kyun Kim, Tong Lu, Hongdong Li, Ming-Hsuan Yang
For the prompt generation, we first propose a prompt pre-training strategy to train a frequency prompt encoder that encodes the ground-truth image into LF and HF prompts.
no code implementations • 4 Feb 2024 • Lu Qi, Yi-Wen Chen, Lehan Yang, Tiancheng Shen, Xiangtai Li, Weidong Guo, Yu Xu, Ming-Hsuan Yang
In this work, we propose a novel approach to densely ground visual entities from a long caption.
1 code implementation • 18 Jan 2024 • Shilin Xu, Haobo Yuan, Qingyu Shi, Lu Qi, Jingbo Wang, Yibo Yang, Yining Li, Kai Chen, Yunhai Tong, Bernard Ghanem, Xiangtai Li, Ming-Hsuan Yang
Segment Anything Model (SAM) is one remarkable model that can achieve generalized segmentation.
no code implementations • CVPR 2024 • Yu-Ju Tsai, Jin-Cheng Jhang, Jingjing Zheng, Wei Wang, Albert Y. C. Chen, Min Sun, Cheng-Hao Kuo, Ming-Hsuan Yang
Specifically on the MatterportLayout dataset it improves 3DIoU from 81. 70% to 82. 57% across the full test set and notably from 54. 80% to 59. 97% in subsets with significant ambiguity.
no code implementations • CVPR 2024 • Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
Our contributions include a novel spatio-temporal video grounding model surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios.
1 code implementation • CVPR 2024 • Xinyan Liu, Guorong Li, Yuankai Qi, Ziheng Yan, Zhenjun Han, Anton Van Den Hengel, Ming-Hsuan Yang, Qingming Huang
To provide a more realistic reflection of the underlying practical challenge we introduce a weakly supervised VIC task wherein trajectory labels are not provided.
no code implementations • 31 Dec 2023 • Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
Our contributions include a novel spatio-temporal video grounding model, surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios.
no code implementations • 21 Dec 2023 • Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, Krishna Somandepalli, Hassan Akbari, Yair Alon, Yong Cheng, Josh Dillon, Agrim Gupta, Meera Hahn, Anja Hauth, David Hendon, Alonso Martinez, David Minnen, Mikhail Sirotenko, Kihyuk Sohn, Xuan Yang, Hartwig Adam, Ming-Hsuan Yang, Irfan Essa, Huisheng Wang, David A. Ross, Bryan Seybold, Lu Jiang
We present VideoPoet, a language model capable of synthesizing high-quality video, with matching audio, from a large variety of conditioning signals.
Ranked #4 on Text-to-Video Generation on MSR-VTT
1 code implementation • CVPR 2024 • Xirui Li, Chao Ma, Xiaokang Yang, Ming-Hsuan Yang
In this work, we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames.
1 code implementation • CVPR 2024 • Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang
We present DrivingGaussian, an efficient and effective framework for surrounding dynamic autonomous driving scenes.
1 code implementation • CVPR 2024 • Kuan-Chih Huang, Weijie Lyu, Ming-Hsuan Yang, Yi-Hsuan Tsai
Recent temporal LiDAR-based 3D object detectors achieve promising performance based on the two-stage proposal-based approach.
1 code implementation • 12 Dec 2023 • Jiangning Zhang, Xuhai Chen, Yabiao Wang, Chengjie Wang, Yong liu, Xiangtai Li, Ming-Hsuan Yang, DaCheng Tao
\Eg, achieving 85. 4 mAD that surpasses UniAD by +3. 0 for the MVTec AD dataset, and it requires only 1. 1 hours and 2. 3G GPU memory to complete model training on a single V100 that can serve as a strong baseline to facilitate the development of future research.
1 code implementation • 12 Dec 2023 • Kuan-Chih Huang, Yi-Hsuan Tsai, Ming-Hsuan Yang
Finally, the training-level constraint is utilized by producing accurate and consistent 3D pseudo-labels that align with the visual data.
1 code implementation • 10 Dec 2023 • Xinyan Liu, Guorong Li, Yuankai Qi, Ziheng Yan, Zhenjun Han, Anton Van Den Hengel, Ming-Hsuan Yang, Qingming Huang
% To provide a more realistic reflection of the underlying practical challenge, we introduce a weakly supervised VIC task, wherein trajectory labels are not provided.
no code implementations • 9 Dec 2023 • Hao Zhang, Fang Li, Lu Qi, Ming-Hsuan Yang, Narendra Ahuja
Addressing Out-Of-Distribution (OOD) Segmentation and Zero-Shot Semantic Segmentation (ZS3) is challenging, necessitating segmenting unseen classes.
1 code implementation • 7 Dec 2023 • Tiantian Wang, Xinxin Zuo, Fangzhou Mu, Jian Wang, Ming-Hsuan Yang
To overcome these limitations, we leverage Neural Radiance Fields (NeRFs) to represent videos, conducting stylization in the rendered feature space.
no code implementations • 5 Dec 2023 • Tao Tu, Ming-Feng Li, Chieh Hubert Lin, Yen-Chi Cheng, Min Sun, Ming-Hsuan Yang
In this work, we study articulated 3D shape reconstruction from a single and casually captured internet video, where the subject's view coverage is incomplete.
1 code implementation • NeurIPS 2023 • Cheng-Ju Ho, Chen-Hsuan Tai, Yen-Yu Lin, Ming-Hsuan Yang, Yi-Hsuan Tsai
Semi-supervised object detection is crucial for 3D scene understanding, efficiently addressing the limitation of acquiring large-scale 3D bounding box annotations.
no code implementations • 5 Dec 2023 • Hsin-Ping Huang, Yu-Chuan Su, Deqing Sun, Lu Jiang, Xuhui Jia, Yukun Zhu, Ming-Hsuan Yang
To achieve detailed control, we propose a unified framework to jointly inject control signals into the existing text-to-video model.
no code implementations • 4 Dec 2023 • Xin Lin, Jingtong Yue, Kelvin C. K. Chan, Lu Qi, Chao Ren, Jinshan Pan, Ming-Hsuan Yang
To guide the restoration model with the features of DINOv2, we develop a DINO-Restore adaption and fusion module to adjust the channel of fused features from PSF and then integrate them with the features from the restoration model.
1 code implementation • CVPR 2024 • Lu Qi, Lehan Yang, Weidong Guo, Yu Xu, Bo Du, Varun Jampani, Ming-Hsuan Yang
On the other hand, the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers.
no code implementations • 4 Dec 2023 • Yunhao Liu, Yu-Ju Tsai, Kelvin C. K. Chan, Xiangtai Li, Lu Qi, Ming-Hsuan Yang
Traditional heuristic approaches-either training models directly on these degraded images or their enhanced counterparts using face restoration techniques-have proven ineffective, primarily due to the degradation of facial features and the discrepancy in image domains.
1 code implementation • 4 Dec 2023 • Chen Zhang, Guorong Li, Yuankai Qi, Hanhua Ye, Laiyun Qing, Ming-Hsuan Yang, Qingming Huang
To address these limitations, we propose a Dynamic Erasing Network (DE-Net) for weakly supervised video anomaly detection, which learns multi-scale temporal features.
2 code implementations • CVPR 2024 • Hung-Yu Tseng, Hsin-Ying Lee, Ming-Hsuan Yang
Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate due to the immitigable domain gap.
1 code implementation • CVPR 2024 • Yuanze Lin, Yi-Wen Chen, Yi-Hsuan Tsai, Lu Jiang, Ming-Hsuan Yang
Language has emerged as a natural interface for image editing.
1 code implementation • CVPR 2024 • Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, Ming-Hsuan Yang
This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing.
Ranked #1 on Semantic correspondence on SPair-71k
2 code implementations • 20 Nov 2023 • Yuheng Liu, Xinke Li, Xueting Li, Lu Qi, Chongshou Li, Ming-Hsuan Yang
We introduce a framework, the Pyramid Discrete Diffusion model (PDD), which employs scale-varied diffusion models to progressively generate high-quality outdoor scenes.
1 code implementation • 6 Nov 2023 • Hao Zhou, Tiancheng Shen, Xu Yang, Hai Huang, Xiangtai Li, Lu Qi, Ming-Hsuan Yang
We benchmarked the proposed evaluation metrics on 12 open-vocabulary methods of three segmentation tasks.
1 code implementation • CVPR 2024 • Hanoona Rasheed, Muhammad Maaz, Sahal Shaji Mullappilly, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Erix Xing, Ming-Hsuan Yang, Fahad S. Khan
In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
no code implementations • 22 Oct 2023 • Yong Du, Jiahui Zhan, Shengfeng He, Xinzhe Li, Junyu Dong, Sheng Chen, Ming-Hsuan Yang
In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences.
2 code implementations • 9 Oct 2023 • Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation.
Ranked #2 on Video Prediction on Kinetics-600 12 frames, 64x64
1 code implementation • NeurIPS 2023 • Meng Liu, Mingda Zhang, Jialu Liu, Hanjun Dai, Ming-Hsuan Yang, Shuiwang Ji, Zheyun Feng, Boqing Gong
In this paper, we present a novel problem, namely video timeline modeling.
no code implementations • ICCV 2023 • Xiaoyu Zhou, Zhiwei Lin, Xiaojun Shan, Yongtao Wang, Deqing Sun, Ming-Hsuan Yang
Recent novel view synthesis methods obtain promising results for relatively small scenes, e. g., indoor environments and scenes with a few objects, but tend to fail for unbounded outdoor scenes with a single image as input.
no code implementations • 10 Sep 2023 • Shuangkang Fang, Yufeng Wang, Yi Yang, Yi-Hsuan Tsai, Wenrui Ding, Shuchang Zhou, Ming-Hsuan Yang
To tackle these issues, we introduce a text-driven editing method, termed DN2N, which allows for the direct acquisition of a NeRF model with universal editing capabilities, eliminating the requirement for retraining.
1 code implementation • ICCV 2023 • Xin Li, Yuqing Huang, Zhenyu He, YaoWei Wang, Huchuan Lu, Ming-Hsuan Yang
Existing visual tracking methods typically take an image patch as the reference of the target to perform tracking.
1 code implementation • ICCV 2023 • Kuan-Chih Huang, Ming-Hsuan Yang, Yi-Hsuan Tsai
In this paper, we find that the motion cue of objects along different time frames is critical in 3D multi-object tracking, which is less explored in existing monocular-based approaches.
1 code implementation • 14 Aug 2023 • Yu-Ju Tsai, Yu-Lun Liu, Lu Qi, Kelvin C. K. Chan, Ming-Hsuan Yang
Restoring facial details from low-quality (LQ) images has remained a challenging problem due to its ill-posedness induced by various degradations in the wild.
Ranked #2 on Blind Face Restoration on WIDER
1 code implementation • 25 Jul 2023 • Muhammad Awais, Muzammal Naseer, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal, Mubarak Shah, Ming-Hsuan Yang, Fahad Shahbaz Khan
Vision systems to see and reason about the compositional nature of visual scenes are fundamental to understanding our world.
1 code implementation • ICCV 2023 • Yunhao Ge, Yuecheng Li, Shuo Ni, Jiaping Zhao, Ming-Hsuan Yang, Laurent Itti
Reprogramming parameters are task-specific and exclusive to each task, which makes our method immune to catastrophic forgetting.
2 code implementations • ICCV 2023 • Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal Naseer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity.
Ranked #2 on Prompt Engineering on ImageNet V2
1 code implementation • 6 Jul 2023 • Liangzhe Yuan, Nitesh Bharadwaj Gundavarapu, Long Zhao, Hao Zhou, Yin Cui, Lu Jiang, Xuan Yang, Menglin Jia, Tobias Weyand, Luke Friedman, Mikhail Sirotenko, Huisheng Wang, Florian Schroff, Hartwig Adam, Ming-Hsuan Yang, Ting Liu, Boqing Gong
We evaluate the video understanding capabilities of existing foundation models (FMs) using a carefully designed experiment protocol consisting of three hallmark tasks (action recognition, temporal localization, and spatiotemporal localization), eight datasets well received by the community, and four adaptation methods tailoring an FM for downstream tasks.
no code implementations • NeurIPS 2023 • Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.
no code implementations • ICCV 2023 • Zhi-Kai Huang, Wei-Ting Chen, Yuan-Chun Chiang, Sy-Yen Kuo, Ming-Hsuan Yang
Crowd counting has recently attracted significant attention in the field of computer vision due to its wide applications to image understanding.
1 code implementation • 28 May 2023 • Lu Qi, Jason Kuen, Weidong Guo, Jiuxiang Gu, Zhe Lin, Bo Du, Yu Xu, Ming-Hsuan Yang
Despite the progress of image segmentation for accurate visual entity segmentation, completing the diverse requirements of image editing applications for different-level region-of-interest selections remains unsolved.
1 code implementation • NeurIPS 2023 • Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, Ming-Hsuan Yang
Text-to-image diffusion models have made significant advances in generating and editing high-quality images.
Ranked #1 on Dense Pixel Correspondence Estimation on TSS
Dense Pixel Correspondence Estimation Representation Learning +2
1 code implementation • 6 May 2023 • Xin Lin, Jingtong Yue, Sixian Ding, Chao Ren, Lu Qi, Ming-Hsuan Yang
Our model features two main components: a Dual Degradation Representation Network (DDR-Net) and a Restoration Network.
no code implementations • 27 Apr 2023 • Tsai-Shien Chen, Chieh Hubert Lin, Hung-Yu Tseng, Tsung-Yi Lin, Ming-Hsuan Yang
In response to this gap, we introduce MCDiff, a conditional diffusion model that generates a video from a starting image frame and a set of strokes, which allow users to specify the intended content and dynamics for synthesis.
no code implementations • 15 Apr 2023 • Hsin-Ping Huang, Yu-Chuan Su, Ming-Hsuan Yang
We tackle the long video generation problem, i. e.~generating videos beyond the output length of video generation models.
2 code implementations • CVPR 2023 • Akshay Dudhane, Syed Waqas Zamir, Salman Khan, Fahad Shahbaz Khan, Ming-Hsuan Yang
Unlike existing methods, the proposed alignment module not only aligns burst features but also exchanges feature information and maintains focused communication with the reference frame through the proposed reference-based feature enrichment mechanism, which facilitates handling complex motions.
1 code implementation • ICCV 2023 • Amandeep Kumar, Ankan Kumar Bhunia, Sanath Narayan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
We present a method to efficiently generate 3D-aware high-resolution images that are view-consistent across multiple target views.
no code implementations • 28 Mar 2023 • Yuanhao Xiong, Long Zhao, Boqing Gong, Ming-Hsuan Yang, Florian Schroff, Ting Liu, Cho-Jui Hsieh, Liangzhe Yuan
Existing video-language pre-training methods primarily focus on instance-level alignment between video clips and captions via global contrastive learning but neglect rich fine-grained local information in both videos and text, which is of importance to downstream tasks requiring temporal localization and semantic reasoning.
5 code implementations • ICCV 2023 • Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed.
1 code implementation • ICCV 2023 • Long Zhao, Liangzhe Yuan, Boqing Gong, Yin Cui, Florian Schroff, Ming-Hsuan Yang, Hartwig Adam, Ting Liu
To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs).
Human-Object Interaction Detection Relationship Detection +2
no code implementations • ICCV 2023 • Chieh Hubert Lin, Hsin-Ying Lee, Willi Menapace, Menglei Chai, Aliaksandr Siarohin, Ming-Hsuan Yang, Sergey Tulyakov
Toward infinite-scale 3D city synthesis, we propose a novel framework, InfiniCity, which constructs and renders an unconstrainedly large and 3D-grounded environment from random noises.
1 code implementation • 3 Jan 2023 • Yue Han, Jiangning Zhang, Yabiao Wang, Chengjie Wang, Yong liu, Lu Qi, Xiangtai Li, Ming-Hsuan Yang
Few-Shot Instance Segmentation (FSIS) requires detecting and segmenting novel classes with limited support examples.
1 code implementation • 3 Jan 2023 • Xiangtai Li, Shilin Xu, Yibo Yang, Haobo Yuan, Guangliang Cheng, Yunhai Tong, Zhouchen Lin, Ming-Hsuan Yang, DaCheng Tao
Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further.
5 code implementations • 2 Jan 2023 • Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan
Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding.
Ranked #1 on Text-to-Image Generation on MS-COCO (FID metric)
1 code implementation • ICCV 2023 • Joungbin An, Hyolim Kang, Su Ho Han, Ming-Hsuan Yang, Seon Joo Kim
Online Action Detection (OAD) is the task of identifying actions in streaming videos without access to future frames.
Ranked #1 on Online Action Detection on TVSeries
1 code implementation • CVPR 2023 • Botao Ye, Sifei Liu, Xueting Li, Ming-Hsuan Yang
In this work, we introduce a self-supervised super-plane constraint by exploring the free geometry cues from the predicted surface, which can further regularize the reconstruction of plane regions without any other ground truth annotations.
no code implementations • ICCV 2023 • Lu Qi, Jason Kuen, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Weidong Guo, Jiaya Jia, Zhe Lin, Ming-Hsuan Yang
Given the high-quality and -resolution nature of the dataset, we propose CropFormer which is designed to tackle the intractability of instance-level segmentation on high-resolution images.
1 code implementation • 22 Dec 2022 • Christoph Mayer, Martin Danelljan, Ming-Hsuan Yang, Vittorio Ferrari, Luc van Gool, Alina Kuznetsova
Our approach achieves a 4x faster run-time in case of 10 concurrent objects compared to tracking each object independently and outperforms existing single object trackers on our new benchmark.
1 code implementation • CVPR 2023 • Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani
Automatically estimating 3D skeleton, shape, camera viewpoints, and part articulation from sparse in-the-wild image ensembles is a severely under-constrained and challenging problem.
1 code implementation • 19 Dec 2022 • Cheng-Ju Ho, Chen-Hsuan Tai, Yi-Hsuan Tsai, Yen-Yu Lin, Ming-Hsuan Yang
In this work, we propose an object-level point augmentor (OPA) that performs local transformations for semi-supervised 3D object detection.
1 code implementation • 12 Dec 2022 • Zhiwei Lin, Yongtao Wang, Shengxiang Qi, Nan Dong, Ming-Hsuan Yang
Based on the property of outdoor point clouds in autonomous driving scenarios, i. e., the point clouds of distant objects are more sparse, we propose point density prediction to enable the 3D encoder to learn location information, which is essential for object detection.
1 code implementation • CVPR 2023 • Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
Ranked #1 on Video Prediction on Something-Something V2
no code implementations • 9 Dec 2022 • Youming Deng, Xueting Li, Sifei Liu, Ming-Hsuan Yang
We present a physics-based inverse rendering method that learns the illumination, geometry, and materials of a scene from posed multi-view RGB images.
2 code implementations • 8 Dec 2022 • Abdelrahman Shaker, Muhammad Maaz, Hanoona Rasheed, Salman Khan, Ming-Hsuan Yang, Fahad Shahbaz Khan
Owing to the success of transformer models, recent works study their applicability in 3D medical segmentation tasks.
1 code implementation • 8 Dec 2022 • Ziheng Yan, Yuankai Qi, Guorong Li, Xinyan Liu, Weigang Zhang, Qingming Huang, Ming-Hsuan Yang
Crowd counting is usually handled in a density map regression fashion, which is supervised via a L2 loss between the predicted density map and ground truth.
1 code implementation • CVPR 2023 • Gaoxiang Cong, Liang Li, Yuankai Qi, ZhengJun Zha, Qi Wu, Wenyu Wang, Bin Jiang, Ming-Hsuan Yang, Qingming Huang
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
no code implementations • CVPR 2023 • Chen Zhang, Guorong Li, Yuankai Qi, Shuhui Wang, Laiyun Qing, Qingming Huang, Ming-Hsuan Yang
Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels.
no code implementations • 8 Dec 2022 • Xinyan Liu, Guorong Li, Yuankai Qi, Zhenjun Han, Qingming Huang, Ming-Hsuan Yang, Nicu Sebe
Crowd localization aims to predict the spatial position of humans in a crowd scenario.
no code implementations • CVPR 2023 • Hsin-Ping Huang, Charles Herrmann, Junhwa Hur, Erika Lu, Kyle Sargent, Austin Stone, Ming-Hsuan Yang, Deqing Sun
Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric.
1 code implementation • CVPR 2023 • Yunhao Ge, Jie Ren, Andrew Gallagher, Yuxiao Wang, Ming-Hsuan Yang, Hartwig Adam, Laurent Itti, Balaji Lakshminarayanan, Jiaping Zhao
We also show that our method improves across ImageNet shifted datasets, four other datasets, and other model architectures such as LiT.
no code implementations • 29 Nov 2022 • Taihong Xiao, ZiRui Wang, Liangliang Cao, Jiahui Yu, Shengyang Dai, Ming-Hsuan Yang
Vision-language foundation models pretrained on large-scale data provide a powerful tool for many visual understanding tasks.
1 code implementation • 21 Nov 2022 • Ling Yang, Zhilin Huang, Yang song, Shenda Hong, Guohao Li, Wentao Zhang, Bin Cui, Bernard Ghanem, Ming-Hsuan Yang
Generating images from graph-structured inputs, such as scene graphs, is uniquely challenging due to the difficulty of aligning nodes and connections in graphs with objects and their relations in images.
1 code implementation • 10 Nov 2022 • Lu Qi, Jason Kuen, Weidong Guo, Tiancheng Shen, Jiuxiang Gu, Jiaya Jia, Zhe Lin, Ming-Hsuan Yang
It improves mask prediction by fusing high-res image crops that provide more fine-grained image details and the full image.
no code implementations • 27 Oct 2022 • Jie Cao, Mandi Luo, Junchi Yu, Ming-Hsuan Yang, Ran He
Then, we optimize the augmented samples by minimizing the norms of the data scores, i. e., the gradients of the log-density functions.
no code implementations • 23 Oct 2022 • Yunfan Liu, Qi Li, Qiyao Deng, Zhenan Sun, Ming-Hsuan Yang
Facial Attribute Manipulation (FAM) aims to aesthetically modify a given face image to render desired attributes, which has received significant attention due to its broad practical applications ranging from digital entertainment to biometric forensics.
2 code implementations • 2 Sep 2022 • Ling Yang, Zhilong Zhang, Yang song, Shenda Hong, Runsheng Xu, Yue Zhao, Yingxia Shao, Wentao Zhang, Bin Cui, Ming-Hsuan Yang
This survey aims to provide a contextualized, in-depth look at the state of diffusion models, identifying the key areas of focus and pointing to potential areas for further exploration.
1 code implementation • 23 Aug 2022 • Chun-Han Yao, Jimei Yang, Duygu Ceylan, Yi Zhou, Yang Zhou, Ming-Hsuan Yang
An alternative approach is to estimate dense vertices of a predefined template body in the image space.
1 code implementation • 8 Aug 2022 • Jean Lahoud, Jiale Cao, Fahad Shahbaz Khan, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Ming-Hsuan Yang
The success of the transformer architecture in natural language processing has recently triggered attention in the computer vision field.
1 code implementation • 1 Aug 2022 • Lu Zhang, Lu Qi, Xu Yang, Hong Qiao, Ming-Hsuan Yang, Zhiyong Liu
In the first stage, we obtain a robust feature extractor, which could serve for all images with base and novel categories.
no code implementations • 15 Jul 2022 • Rui Qian, Yeqing Li, Zheng Xu, Ming-Hsuan Yang, Serge Belongie, Yin Cui
Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition.
Ranked #1 on Zero-Shot Action Recognition on HMDB51
no code implementations • 7 Jul 2022 • Chun-Han Yao, Wei-Chih Hung, Yuanzhen Li, Michael Rubinstein, Ming-Hsuan Yang, Varun Jampani
In this work, we propose a practical problem setting to estimate 3D pose and shape of animals given only a few (10-30) in-the-wild images of a particular animal species (say, horse).
1 code implementation • 4 Jul 2022 • Zhiwei Lin, TingTing Liang, Taihong Xiao, Yongtao Wang, Zhi Tang, Ming-Hsuan Yang
To address this issue, we propose a neural architecture search method named FlowNAS to automatically find the better encoder architecture for flow estimation task.
no code implementations • 2 Jun 2022 • Chieh Hubert Lin, Hsin-Ying Lee, Hung-Yu Tseng, Maneesh Singh, Ming-Hsuan Yang
Recent studies show that paddings in convolutional neural networks encode absolute position information which can negatively affect the model performance for certain tasks.