1 code implementation • Knowledge-Based Systems 2023 • Steve Andreas Immanuel, Cheol Jeong
Due to the high computational cost of the self-attention and the high dimensional data of video, they either have to settle for: 1) only training the cross-modal encoder on offline-extracted video and text features or 2) training the cross-modal encoder with the video and text feature extractor, but only using sparsely-sampled video frames.
Ranked #7 on
TGIF-Transition
on TGIF-QA
1 code implementation • Journal of King Saud University - Computer and Information Sciences 2023 • Willy Fitra Hendria, Vania Velda, Bahy Helmi Hartoyo Putra, Fikriansyah Adzaka, Cheol Jeong
However, directly using the action features without object-specific representation may not well capture the object interactions.
Ranked #4 on
Video Captioning
on MSVD-CTN
no code implementations • ICT Express 2021 • Willy Fitra Hendria, Quang Thinh Phan, Fikriansyah Adzaka, Cheol Jeong
Combining multiple models is a well-known technique to improve predictive performance in challenging tasks such as object detection in UAV imagery.