no code implementations • 29 Apr 2024 • Xiang Li, Zhi-Qi Cheng, Jun-Yan He, Xiaojiang Peng, Alexander G. Hauptmann
Emotional Text-to-Speech (E-TTS) synthesis has gained significant attention in recent years due to its potential to enhance human-computer interaction.
no code implementations • 9 Oct 2023 • Lijun Yu, José Lezama, Nitesh B. Gundavarapu, Luca Versari, Kihyuk Sohn, David Minnen, Yong Cheng, Vighnesh Birodkar, Agrim Gupta, Xiuye Gu, Alexander G. Hauptmann, Boqing Gong, Ming-Hsuan Yang, Irfan Essa, David A. Ross, Lu Jiang
While Large Language Models (LLMs) are the dominant models for generative tasks in language, they do not perform as well as diffusion models on image and video generation.
Ranked #2 on Video Prediction on Kinetics-600 12 frames, 64x64
no code implementations • NeurIPS 2023 • Lijun Yu, Yong Cheng, Zhiruo Wang, Vivek Kumar, Wolfgang Macherey, Yanping Huang, David A. Ross, Irfan Essa, Yonatan Bisk, Ming-Hsuan Yang, Kevin Murphy, Alexander G. Hauptmann, Lu Jiang
In this work, we introduce Semantic Pyramid AutoEncoder (SPAE) for enabling frozen LLMs to perform both understanding and generation tasks involving non-linguistic modalities such as images or videos.
no code implementations • 15 Jun 2023 • Lijun Yu, Jin Miao, Xiaoyu Sun, Jiayi Chen, Alexander G. Hauptmann, Hanjun Dai, Wei Wei
Document understanding tasks, in particular, Visually-rich Document Entity Retrieval (VDER), have gained significant attention in recent years thanks to their broad applications in enterprise AI.
1 code implementation • ICCV 2023 • Zhi-Qi Cheng, Qi Dai, SiYao Li, Jingdong Sun, Teruko Mitamura, Alexander G. Hauptmann
We evaluate ChartReader on Chart-to-Table, ChartQA, and Chart-to-Text tasks, demonstrating its superiority over existing methods.
no code implementations • ICCV 2023 • Yijun Qian, Jack Urbanek, Alexander G. Hauptmann, Jungdam Won
Given its wide applications, there is increasing focus on generating 3D human motions from textual descriptions.
1 code implementation • CVPR 2023 • Lijun Yu, Yong Cheng, Kihyuk Sohn, José Lezama, Han Zhang, Huiwen Chang, Alexander G. Hauptmann, Ming-Hsuan Yang, Yuan Hao, Irfan Essa, Lu Jiang
We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model.
Ranked #1 on Video Prediction on Something-Something V2
1 code implementation • 18 Aug 2022 • Zhi-Qi Cheng, Qi Dai, SiYao Li, Teruko Mitamura, Alexander G. Hauptmann
In the second stage, we exploit transformer layers to unearth the potential semantic relations within both verbs and semantic roles.
1 code implementation • CVPR 2022 • Zhi-Qi Cheng, Qi Dai, Hong Li, Jingkuan Song, Xiao Wu, Alexander G. Hauptmann
We evaluate our methods on 4 mainstream object counting networks (i. e., MCNN, CSRNet, SANet, and ResNet-50).
Ranked #1 on Object Counting on TRANCOS
no code implementations • 14 Jan 2022 • Lijun Yu, Yijun Qian, Wenhe Liu, Alexander G. Hauptmann
Activity detection is one of the attractive computer vision tasks to exploit the video streams captured by widely installed cameras.
no code implementations • CVPR 2022 • Salvador Medina, Denis Tome, Carsten Stoll, Mark Tiede, Kevin Munhall, Alexander G. Hauptmann, Iain Matthews
In this work, we introduce a large-scale speech and mocap dataset that focuses on capturing tongue, jaw, and lip motion.
no code implementations • 2 May 2021 • Ting-yao Hu, Zhi-Qi Cheng, Alexander G. Hauptmann
In this paper, we propose a subspace representation learning (SRL) framework to tackle few-shot image classification tasks.
no code implementations • 19 Feb 2021 • Ting-yao Hu, Alexander G. Hauptmann
In this paper, we propose a novel approach to solve the pose guided person image generation task.
1 code implementation • NeurIPS 2020 • Guoliang Kang, Yunchao Wei, Yi Yang, Yueting Zhuang, Alexander G. Hauptmann
The conventional solution to this task is to minimize the discrepancy between source and target to enable effective knowledge transfer.
Ranked #25 on Synthetic-to-Real Translation on SYNTHIA-to-Cityscapes
no code implementations • 1 Feb 2020 • Lijun Yu, Peng Chen, Wenhe Liu, Guoliang Kang, Alexander G. Hauptmann
To deal with the aforementioned problems, in this paper, we propose a training-free monocular 3D event detection system for traffic surveillance.
2 code implementations • CVPR 2019 • Guoliang Kang, Lu Jiang, Yi Yang, Alexander G. Hauptmann
Unsupervised Domain Adaptation (UDA) makes predictions for the target domain data while manual annotations are only available in the source domain.
Ranked #7 on Domain Adaptation on Office-31
no code implementations • ECCV 2018 • Xiaojun Chang, Po-Yao Huang, Yi-Dong Shen, Xiaodan Liang, Yi Yang, Alexander G. Hauptmann
In this paper, we address this problem by training relational context-aware agents which learn the actions to localize the target person from the gallery of whole scene images.
no code implementations • 3 Aug 2018 • Ting-yao Hu, Xiaojun Chang, Alexander G. Hauptmann
In this work, we propose the idea of visual distributional representation, which interprets an image set as samples drawn from an unknown distribution in appearance feature space.
1 code implementation • 24 Apr 2018 • Ankit Shah, Anurag Kumar, Alexander G. Hauptmann, Bhiksha Raj
In this work, we first describe a CNN based approach for weakly supervised training of audio events.
1 code implementation • CVPR 2018 • Jiang Liu, Chenqiang Gao, Deyu Meng, Alexander G. Hauptmann
DecideNet starts with estimating the crowd density by generating detection and regression based density maps separately.
Ranked #10 on Crowd Counting on WorldExpo’10
no code implementations • ICCV 2017 • Hehe Fan, Xiaojun Chang, De Cheng, Yi Yang, Dong Xu, Alexander G. Hauptmann
relevant) to the given event class, we formulate this task as a multi-instance learning (MIL) problem by taking each video as a bag and the video shots in each video as instances.
no code implementations • 25 Jul 2017 • De Cheng, Yihong Gong, Zhihui Li, Weiwei Shi, Alexander G. Hauptmann, Nanning Zheng
The proposed method can take full advantages of the structured distance relationships among these training samples, with the constructed complete graph.
no code implementations • 5 Jul 2017 • Po-Yao Huang, Ye Yuan, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann
We report on CMU Informedia Lab's system used in Google's YouTube 8 Million Video Understanding Challenge.
3 code implementations • 2 Apr 2017 • Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexander G. Hauptmann
State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs.
Ranked #20 on Action Recognition on UCF101
no code implementations • 8 Feb 2017 • Yi Zhu, Zhenzhong Lan, Shawn Newsam, Alexander G. Hauptmann
We study the unsupervised learning of CNNs for optical flow estimation using proxy ground truth data.
no code implementations • 4 Feb 2017 • Minnan Luo, Xiaojun Chang, Zhihui Li, Liqiang Nie, Alexander G. Hauptmann, Qinghua Zheng
The heterogeneity-gap between different modalities brings a significant challenge to multimedia information retrieval.
no code implementations • 25 Jan 2017 • Zhenzhong Lan, Yi Zhu, Alexander G. Hauptmann
We investigate the problem of representing an entire video using CNN features for human action recognition.
no code implementations • 10 Oct 2016 • Liang Zheng, Yi Yang, Alexander G. Hauptmann
Person re-identification (re-ID) has become increasingly popular in the community due to its application and research significance.
Ranked #83 on Person Re-Identification on DukeMTMC-reID
no code implementations • 12 Aug 2016 • Mengyi Liu, Lu Jiang, Shiguang Shan, Alexander G. Hauptmann
Multimedia event detection has been receiving increasing attention in recent years.
no code implementations • 17 Jun 2016 • Shoou-I Yu, Yi Yang, Zhongwen Xu, Shicheng Xu, Deyu Meng, Zexi Mao, Zhigang Ma, Ming Lin, Xuanchong Li, Huan Li, Zhenzhong Lan, Lu Jiang, Alexander G. Hauptmann, Chuang Gan, Xingzhong Du, Xiaojun Chang
The large number of user-generated videos uploaded on to the Internet everyday has led to many commercial video search engines, which mainly rely on text metadata for search.
no code implementations • 25 Apr 2016 • Shoou-I Yu, Yi Yang, Xuanchong Li, Alexander G. Hauptmann
Therefore, our tracker propagates identity information to frames without recognized faces by uncovering the appearance and spatial manifold formed by person detections.
no code implementations • 14 Jan 2016 • Xiaojun Chang, Yi Yang, Guodong Long, Chengqi Zhang, Alexander G. Hauptmann
In this paper, we focus on automatically detecting events in unconstrained videos without the use of any visual training exemplars.
no code implementations • 11 Dec 2015 • Zhenzhong Lan, Shoou-I Yu, Alexander G. Hauptmann
We propose two well-motivated ranking-based methods to enhance the performance of current state-of-the-art human activity recognition systems.
no code implementations • 16 Nov 2015 • Zhenzhong Lan, Shoou-I Yu, Ming Lin, Bhiksha Raj, Alexander G. Hauptmann
We approach this problem by first showing that local handcrafted features and Convolutional Neural Networks (CNNs) share the same convolution-pooling network structure.
no code implementations • 15 Nov 2015 • Linchao Zhu, Zhongwen Xu, Yi Yang, Alexander G. Hauptmann
In this work, we introduce Video Question Answering in temporal domain to infer the past, describe the present and predict the future.
no code implementations • 15 Oct 2015 • Zhenzhong Lan, Alexander G. Hauptmann
We address the problem of generating video features for action recognition.
no code implementations • 13 Feb 2015 • Zhenzhong Lan, Xuanchong Li, Ming Lin, Alexander G. Hauptmann
Therefore, they need to occur frequently enough in the videos and to be be able to tell the difference among different types of motions.
no code implementations • CVPR 2015 • Zhenzhong Lan, Ming Lin, Xuanchong Li, Alexander G. Hauptmann, Bhiksha Raj
MIFS compensates for information lost from using differential operators by recapturing information at coarse scales.
no code implementations • CVPR 2015 • Zhongwen Xu, Yi Yang, Alexander G. Hauptmann
In this paper, we propose a discriminative video representation for event detection over a large scale video dataset when only limited hardware resources are available.
no code implementations • CVPR 2014 • Zhongwen Xu, Ivor W. Tsang, Yi Yang, Zhigang Ma, Alexander G. Hauptmann
We address the challenging problem of utilizing related exemplars for complex event detection while multiple features are available.
no code implementations • CVPR 2013 • Zhigang Ma, Yi Yang, Zhongwen Xu, Shuicheng Yan, Nicu Sebe, Alexander G. Hauptmann
Compared to complex event videos, these external videos contain simple contents such as objects, scenes and actions which are the basic elements of complex events.