no code implementations • ICCV 2023 • Sarah Ibrahimi, Xiaohang Sun, Pichao Wang, Amanmeet Garg, Ashutosh Sanan, Mohamed Omar
Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment.
Ranked #10 on Video Retrieval on MSR-VTT
no code implementations • Submitted to ICLR 2022 • Wentao Zhu, Jingru Yi, Kevin Hsu, Xiaohang Sun, Xiang Hao, Linda Liu, Mohamed Omar
AVT uses a combination of video and audio signals to improve action recognition accuracy, leveraging the effective spatio-temporal representation by the video Transformer.
Ranked #4 on Multi-modal Classification on VGG-Sound
no code implementations • Submitted to ICLR 2022 • Wentao Zhu, Jingru Yi, Xiaohang Sun, Xiang Hao, Linda Liu, Mohamed Omar
In this work, we develop a multiscale multimodal Transformer (MMT) that employs hierarchical representation learning.
Ranked #1 on Multi-modal Classification on VGG-Sound
no code implementations • ICML 2020 • Yaotian Wang, Xiaohang Sun, Jason W. Fleischer
Recovering a signal from its Fourier intensity underlies many important applications, including lensless imaging and imaging through scattering media.
no code implementations • CVPR 2019 • Wei-An Lin, Haofu Liao, Cheng Peng, Xiaohang Sun, Jingdan Zhang, Jiebo Luo, Rama Chellappa, Shaohua Kevin Zhou
The linkage between the sigogram and image domains is a novel Radon inversion layer that allows the gradients to back-propagate from the image domain to the sinogram domain during training.