no code implementations • 2 Oct 2024 • Mike Ranzinger, Jon Barker, Greg Heinrich, Pavlo Molchanov, Bryan Catanzaro, Andrew Tao
Various visual foundation models have distinct strengths and weaknesses, both of which can be improved through heterogeneous multi-teacher knowledge distillation without labels, termed "agglomerative models."
1 code implementation • 28 Aug 2024 • Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catanzaro, Andrew Tao, Jan Kautz, Zhiding Yu, Guilin Liu
We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies.
no code implementations • 26 Jul 2024 • Boyi Li, Ligeng Zhu, Ran Tian, Shuhan Tan, Yuxiao Chen, Yao Lu, Yin Cui, Sushant Veer, Max Ehrlich, Jonah Philion, Xinshuo Weng, Fuzhao Xue, Andrew Tao, Ming-Yu Liu, Sanja Fidler, Boris Ivanovic, Trevor Darrell, Jitendra Malik, Song Han, Marco Pavone
Finally, we establish a benchmark for video captioning and introduce a leaderboard, aiming to accelerate advancements in video understanding, captioning, and data alignment.
no code implementations • 29 May 2024 • Hanrong Ye, De-An Huang, Yao Lu, Zhiding Yu, Wei Ping, Andrew Tao, Jan Kautz, Song Han, Dan Xu, Pavlo Molchanov, Hongxu Yin
We introduce X-VILA, an omni-modality model designed to extend the capabilities of large language models (LLMs) by incorporating image, video, and audio modalities.
3 code implementations • CVPR 2024 • Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo Molchanov, Andrew Tao, Huizi Mao, Jan Kautz, Mohammad Shoeybi, Song Han
Visual language models (VLMs) rapidly progressed with the recent success of large language models.
Ranked #3 on Zero-Shot Video Question Answer on MSVD-QA
2 code implementations • 9 Jun 2023 • Ali Hatamizadeh, Greg Heinrich, Hongxu Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov
At a high level, global self-attentions enable the efficient cross-window communication at lower costs.
no code implementations • 18 May 2023 • Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro
In this work, to overcome these limitations of generated datasets, we have two main contributions which lead us to achieve state-of-the-art results on challenging objects: 1) A robust multi-stage learning scheme that gradually relies more on the models own predictions when calculating losses, 2) A novel adversarial learning pipeline with online pseudo-ground truth generations to achieve fine details.
no code implementations • ICCV 2023 • Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, Yogesh Balaji
Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy.
Ranked #8 on Text-to-Video Generation on UCF-101
no code implementations • 17 Mar 2022 • Aysegul Dundar, Jun Gao, Andrew Tao, Bryan Catanzaro
The reconstruction is posed as an adaptation problem and is done progressively where in the first stage, we focus on learning accurate geometry, whereas in the second stage, we focus on learning the texture with a generative adversarial network.
no code implementations • 31 Jan 2022 • Max Ehrlich, Jon Barker, Namitha Padmanabhan, Larry Davis, Andrew Tao, Bryan Catanzaro, Abhinav Shrivastava
Video compression is a central feature of the modern internet powering technologies from social media to video conferencing.
3 code implementations • 24 Nov 2021 • John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro
AFNO is based on a principled foundation of operator learning which allows us to frame token mixing as a continuous global convolution without any dependence on the input resolution.
no code implementations • ICLR 2022 • John Guibas, Morteza Mardani, Zongyi Li, Andrew Tao, Anima Anandkumar, Bryan Catanzaro
AFNO is based on a principled foundation of operator learning which allows us to frame token mixing as a continuous global convolution without any dependence on the input resolution.
no code implementations • CVPR 2021 • Anand Bhattad, Aysegul Dundar, Guilin Liu, Andrew Tao, Bryan Catanzaro
We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing.
1 code implementation • ICCV 2021 • Ning Yu, Guilin Liu, Aysegul Dundar, Andrew Tao, Bryan Catanzaro, Larry Davis, Mario Fritz
Lastly, we study different attention architectures in the discriminator, and propose a reference attention mechanism.
no code implementations • NeurIPS 2020 • Morteza Mardani, Guilin Liu, Aysegul Dundar, Shiqiu Liu, Andrew Tao, Bryan Catanzaro
The conventional CNNs, recently adopted for synthesis, require to train and test on the same set of images and fail to generalize to unseen images.
no code implementations • 14 Jul 2020 • Guilin Liu, Rohan Taori, Ting-Chun Wang, Zhiding Yu, Shiqiu Liu, Fitsum A. Reda, Karan Sapra, Andrew Tao, Bryan Catanzaro
Specifically, we directly treat the whole encoded feature map of the input texture as transposed convolution filters and the features' self-similarity map, which captures the auto-correlation information, as input to the transposed convolution.
8 code implementations • 21 May 2020 • Andrew Tao, Karan Sapra, Bryan Catanzaro
Multi-scale inference is commonly used to improve the results of semantic segmentation.
Ranked #6 on Semantic Segmentation on Cityscapes val (using extra training data)
no code implementations • CVPR 2020 • Aysegul Dundar, Karan Sapra, Guilin Liu, Andrew Tao, Bryan Catanzaro
Conditional image synthesis for generating photorealistic images serves various applications for content editing to content generation.
1 code implementation • 26 Jan 2020 • Aysegul Dundar, Kevin J. Shih, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro
However, the reconstruction task of the entire image forces the model to allocate landmarks to model the background.
no code implementations • 25 Dec 2019 • Rafael Valle, Fitsum Reda, Mohammad Shoeybi, Patrick Legresley, Andrew Tao, Bryan Catanzaro
We propose a novel approach for image segmentation that combines Neural Ordinary Differential Equations (NODEs) and the Level Set method.
6 code implementations • NeurIPS 2019 • Ting-Chun Wang, Ming-Yu Liu, Andrew Tao, Guilin Liu, Jan Kautz, Bryan Catanzaro
To address the limitations, we propose a few-shot vid2vid framework, which learns to synthesize videos of previously unseen subjects or scenes by leveraging few example images of the target at test time.
Ranked #1 on Video-to-Video Synthesis on YouTube Dancing
no code implementations • 6 Sep 2019 • Kevin J. Shih, Aysegul Dundar, Animesh Garg, Robert Pottorf, Andrew Tao, Bryan Catanzaro
Prediction and interpolation for long-range video data involves the complex task of modeling motion trajectories for each visible object, occlusions and dis-occlusions, as well as appearance changes due to viewpoint and lighting.
1 code implementation • ICCV 2019 • Fitsum A. Reda, Deqing Sun, Aysegul Dundar, Mohammad Shoeybi, Guilin Liu, Kevin J. Shih, Andrew Tao, Jan Kautz, Bryan Catanzaro
We further introduce a pseudo supervised loss term that enforces the interpolated frames to be consistent with predictions of a pre-trained interpolation model.
Ranked #1 on Video Frame Interpolation on UCF101 (PSNR (sRGB) metric)
3 code implementations • CVPR 2019 • Ji Zhang, Kevin J. Shih, Ahmed Elgammal, Andrew Tao, Bryan Catanzaro
The first, Entity Instance Confusion, occurs when the model confuses multiple instances of the same type of entity (e. g. multiple cups).
5 code implementations • CVPR 2019 • Yi Zhu, Karan Sapra, Fitsum A. Reda, Kevin J. Shih, Shawn Newsam, Andrew Tao, Bryan Catanzaro
In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks.
Ranked #2 on Semantic Segmentation on KITTI Semantic Segmentation (using extra training data)
4 code implementations • 28 Nov 2018 • Guilin Liu, Kevin J. Shih, Ting-Chun Wang, Fitsum A. Reda, Karan Sapra, Zhiding Yu, Andrew Tao, Bryan Catanzaro
In this paper, we present a simple yet effective padding scheme that can be used as a drop-in module for existing convolutional neural networks.
no code implementations • 21 Nov 2018 • Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal
We propose an efficient and interpretable scene graph generator.
2 code implementations • 2 Nov 2018 • Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro
We present an approach for high-resolution video frame prediction by conditioning on both past frames and past optical flows.
no code implementations • 1 Nov 2018 • Ji Zhang, Kevin Shih, Andrew Tao, Bryan Catanzaro, Ahmed Elgammal
This article describes the model we built that achieved 1st place in the OpenImage Visual Relationship Detection Challenge on Kaggle.
1 code implementation • ECCV 2018 • Fitsum A. Reda, Guilin Liu, Kevin J. Shih, Robert Kirby, Jon Barker, David Tarjan, Andrew Tao, Bryan Catanzaro
We present an approach for high-resolution video frame prediction by conditioning on both past frames and past optical flows.
Ranked #1 on Video Prediction on YouTube-8M
11 code implementations • NeurIPS 2018 • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Guilin Liu, Andrew Tao, Jan Kautz, Bryan Catanzaro
We study the problem of video-to-video synthesis, whose goal is to learn a mapping function from an input source video (e. g., a sequence of semantic segmentation masks) to an output photorealistic video that precisely depicts the content of the source video.
60 code implementations • ECCV 2018 • Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro
Existing deep learning based image inpainting methods use a standard convolutional network over the corrupted image, using convolutional filter responses conditioned on both valid pixels as well as the substitute values in the masked holes (typically the mean value).
20 code implementations • CVPR 2018 • Ting-Chun Wang, Ming-Yu Liu, Jun-Yan Zhu, Andrew Tao, Jan Kautz, Bryan Catanzaro
We present a new method for synthesizing high-resolution photo-realistic images from semantic label maps using conditional generative adversarial networks (conditional GANs).
Ranked #2 on Sketch-to-Image Translation on COCO-Stuff
Conditional Image Generation Fundus to Angiography Generation +5