1 code implementation • 19 Sep 2023 • Shiwen Zhang, Shuai Xiao, Weilin Huang
Text-guided image editing on real or synthetic images, given only the original image itself and the target text prompt as inputs, is a very general and challenging task.
no code implementations • 11 Mar 2022 • Shiwen Zhang
New video classification benchmarks aiming to eliminate static biases are proposed, with experiments on these new benchmarks showing that the current clip-based 3D CNNs are outperformed by RNN structures and recent video transformers.
Ranked #2 on Video Object Tracking on CATER
no code implementations • ICLR 2020 • Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R. Scott, Li-Min Wang
Most existing 3D CNN structures for video representation learning are clip-based methods, and do not consider video-level temporal evolution of spatio-temporal features.
no code implementations • 18 Feb 2020 • Shiwen Zhang, Sheng Guo, Li-Min Wang, Weilin Huang, Matthew R. Scott
We design a three-branch architecture consisting of a main branch for action recognition, and two auxiliary branches for human parsing and scene recognition which allow the model to encode the knowledge of human and scene for action recognition.
1 code implementation • 18 Feb 2020 • Shiwen Zhang, Sheng Guo, Weilin Huang, Matthew R. Scott, Li-Min Wang
Most existing 3D CNNs for video representation learning are clip-based methods, and thus do not consider video-level temporal evolution of spatio-temporal features.