no code implementations • 15 Jan 2025 • YuAn Wang, Bin Xhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang
These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images.
no code implementations • 20 Dec 2024 • Jiaxin Wu, Chong-Wah Ngo, Xiao-Yong Wei, Qing Li
The generated queries retrieve different rank lists from the original query.
no code implementations • 19 Nov 2024 • Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Parameter-efficient fine-tuning multimodal large language models (MLLMs) presents significant challenges, including reliance on high-level visual features that limit fine-grained detail comprehension, and data conflicts that arise from task complexity.
no code implementations • 13 Nov 2024 • Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients.
1 code implementation • 16 Oct 2024 • Genta Indra Winata, Frederikus Hudi, Patrick Amadeus Irawan, David Anugraha, Rifki Afina Putri, Yutong Wang, Adam Nohejl, Ubaidillah Ariq Prathama, Nedjma Ousidhoum, Afifa Amriani, Anar Rzayev, Anirban Das, Ashmari Pramodya, Aulia Adila, Bryan Wilie, Candy Olivia Mawalim, Ching Lam Cheng, Daud Abolade, Emmanuele Chersoni, Enrico Santus, Fariz Ikhwantri, Garry Kuwanto, Hanyang Zhao, Haryo Akbarianto Wibowo, Holy Lovenia, Jan Christian Blaise Cruz, Jan Wira Gotama Putra, Junho Myung, Lucky Susanto, Maria Angelica Riera Machin, Marina Zhukova, Michael Anugraha, Muhammad Farid Adilazuarda, Natasha Santosa, Peerat Limkonchotiwat, Raj Dabre, Rio Alexander Audino, Samuel Cahyawijaya, Shi-Xiong Zhang, Stephanie Yulia Salim, Yi Zhou, Yinxuan Gui, David Ifeoluwa Adelani, En-Shiun Annie Lee, Shogo Okada, Ayu Purwarianti, Alham Fikri Aji, Taro Watanabe, Derry Tanti Wijaya, Alice Oh, Chong-Wah Ngo
This benchmark includes a visual question answering (VQA) dataset with text-image pairs across 30 languages and dialects, spanning 9 language families and featuring over 1 million data points, making it the largest multicultural VQA benchmark to date.
1 code implementation • 11 Sep 2024 • Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei
Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness.
no code implementations • 17 Jul 2024 • Pengkun Jiao, Xinlan Wu, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yugang Jiang
Uni-Food is designed to provide a more holistic approach to food data analysis, thereby enhancing the performance and capabilities of LMMs in this domain.
1 code implementation • 3 Jul 2024 • Yanbin Hao, Diansong Zhou, Zhicai Wang, Chong-Wah Ngo, Meng Wang
In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks.
no code implementations • 9 Apr 2024 • Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan
Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%.
no code implementations • CVPR 2024 • Xiongwei Wu, Sicheng Yu, Ee-Peng Lim, Chong-Wah Ngo
The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food, while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task.
1 code implementation • 19 Feb 2024 • Jiaxin Wu, Chong-Wah Ngo
Answering query with semantic concepts has long been the mainstream approach for video search.
no code implementations • 22 Dec 2023 • Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, Chong-Wah Ngo
In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain.
no code implementations • 28 Jun 2023 • Huu-Thanh Nguyen, Yu Cao, Chong-Wah Ngo, Wing-Kwong Chan
The power of the framework is a novel difficulty assessment model, which forecasts how challenging an unlabelled sample is to the latest trained instance segmentation model.
1 code implementation • 27 Jun 2023 • Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou
Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data.
no code implementations • 19 Feb 2023 • Zhixin Ma, Chong-Wah Ngo
Nevertheless, when the first few pages of results are swamped with visually similar items, or the search target is hidden deep in the ranked list, finding the know-item target usually requires a long duration of browsing and result inspection.
no code implementations • ICCV 2023 • Qi Cai, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei
Recent progress on multi-modal 3D object detection has featured BEV (Bird-Eye-View) based fusion, which effectively unifies both LiDAR point clouds and camera images in a shared BEV space.
no code implementations • 16 Nov 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
This technical report describes the CONE approach for Ego4D Natural Language Queries (NLQ) Challenge in ECCV 2022.
1 code implementation • 15 Nov 2022 • Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei
The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes.
1 code implementation • 22 Sep 2022 • Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan
This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.
1 code implementation • 12 Jul 2022 • Hao Zhang, Lechao Cheng, Yanbin Hao, Chong-Wah Ngo
By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead ($\sim$2. 6\%).
2 code implementations • 11 Jul 2022 • Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, Tao Mei
Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.
Ranked #226 on Image Classification on ImageNet
1 code implementation • 1 Jul 2022 • Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Zhijian Hou
Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data.
no code implementations • CVPR 2022 • Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei
By deriving the novel grouped time mixing (GTM) operations, we equip the basic token-mixing MLP with the ability of temporal modeling.
Ranked #22 on Action Recognition on Something-Something V1
no code implementations • 8 May 2022 • Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Wing-Kwong Chan
To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space.
1 code implementation • 26 Apr 2022 • Zixuan Su, Hao Zhang, Jingjing Chen, Lei Pang, Chong-Wah Ngo, Yu-Gang Jiang
Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers.
1 code implementation • CVPR 2022 • Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Xiangnan He
By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities.
Ranked #3 on Egocentric Activity Recognition on EGTEA
no code implementations • ICCV 2021 • Zhaofan Qiu, Ting Yao, Yan Shu, Chong-Wah Ngo, Tao Mei
This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" and then exploits off-the-shelf image recognition system on the synthetic frame.
no code implementations • CVPR 2021 • Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xiao-Ping Zhang, Dong Wu, Tao Mei
Video content is multifaceted, consisting of objects, scenes, interactions or actions.
1 code implementation • 11 Jan 2022 • Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei
In this paper, we decompose the path into a series of training "states" and specify the hyper-parameters, e. g., learning rate and the length of input clips, in each state.
1 code implementation • 21 Sep 2021 • Zhijian Hou, Chong-Wah Ngo, Wing Kwong Chan
This task is essential because advanced video retrieval applications should enable users to retrieve a precise moment from a large video corpus.
Ranked #1 on Video Corpus Moment Retrieval on TVR
3 code implementations • 5 Aug 2021 • Hao Zhang, Yanbin Hao, Chong-Wah Ngo
It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding.
no code implementations • 21 May 2021 • Qiyuan Liang, Bin Zhu, Chong-Wah Ngo
In this paper, we propose the pyramid fusion dark channel prior (PF-DCP) for single image dehazing.
no code implementations • 20 Aug 2020 • Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yu-Gang Jiang, Tat-Seng Chua
Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe.
no code implementations • CVPR 2020 • Yiheng Zhang, Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Dong Liu, Tao Mei
In the view of extremely expensive expert labeling, recent research has shown that the models trained on photo-realistic synthetic data (e. g., computer games) with computer-generated annotations can be adapted to real images.
Ranked #19 on Domain Adaptation on SYNTHIA-to-Cityscapes
no code implementations • CVPR 2020 • Yingwei Pan, Ting Yao, Yehao Li, Chong-Wah Ngo, Tao Mei
A clustering branch is capitalized on to ensure that the learnt representation preserves such underlying structure by matching the estimated assignment distribution over clusters to the inherent cluster distribution for each target sample.
1 code implementation • 19 May 2020 • Wan-Lei Zhao, Run-Qing Chen, Hui Ye, Chong-Wah Ngo
This optimization procedure converges faster to a better local minimum over k-means and many of its variants.
no code implementations • 1 Feb 2020 • Hui-Chu Xiao, Wan-Lei Zhao, Jie Lin, Chong-Wah Ngo
Due to the lack of proper mechanism in locating instances and deriving feature representation, instance search is generally only effective for retrieving instances of known object categories.
1 code implementation • 2 Aug 2019 • Wan-Lei Zhao, Hui Wang, Peng-Cheng Lin, Chong-Wah Ngo
Unfortunately, a closely related issue of how to merge two existing k-NN graphs has been overlooked.
no code implementations • 20 Jun 2019 • Fuchen Long, Qi Cai, Zhaofan Qiu, Zhijian Hou, Yingwei Pan, Ting Yao, Chong-Wah Ngo
This notebook paper presents an overview and comparative analysis of our system designed for activity detection in extended videos (ActEV-PC) in ActivityNet Challenge 2019.
no code implementations • CVPR 2019 • Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, Tao Mei
Diffusions effectively interact two aspects of information, i. e., localized and holistic, for more powerful way of representation learning.
Ranked #10 on Action Recognition on UCF101
no code implementations • CVPR 2019 • Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, Tao Mei
Specifically, we present Transferrable Prototypical Networks (TPN) for adaptation such that the prototypes for each class in source and target domains are close in the embedding space and the score distributions predicted by prototypes separately on source and target data are similar.
1 code implementation • CVPR 2019 • Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, Ling-Yu Duan, Ting Yao
The whole architecture is then optimized with three consistency regularizations: 1) region-level consistency to align the region-level predictions between teacher and student, 2) inter-graph consistency for matching the graph structures between teacher and student, and 3) intra-graph consistency to enhance the similarity between regions of same class within the graph of student.
no code implementations • 14 Apr 2018 • Zhi-Qi Cheng, Hao Zhang, Xiao Wu, Chong-Wah Ngo
A principle way of hyperlinking can be carried out by picking centers of clusters as anchors and from there reach out to targets within or outside of clusters with consideration of neighborhood complexity.
no code implementations • 9 Apr 2018 • Wan-Lei Zhao, Hui Wang, Chong-Wah Ngo
On the one hand, the approximate k-nearest neighbor graph construction is treated as a search task.
1 code implementation • 27 Feb 2017 • Joachim D. Curtó, Irene C. Zarza, Feng Yang, Alexander J. Smola, Fernando de la Torre, Chong-Wah Ngo, Luc van Gool
The algorithm requires to compute the product of Walsh Hadamard Transform (WHT) matrices.
no code implementations • 8 Oct 2016 • Wan-Lei Zhao, Cheng-Hao Deng, Chong-Wah Ngo
The performance of k-means has been enhanced from different perspectives over the years.
no code implementations • ICCV 2015 • Ting Yao, Tao Mei, Chong-Wah Ngo
One of the fundamental problems in image search is to learn the ranking functions, i. e., similarity between the query and image.
no code implementations • CVPR 2015 • Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, Tao Mei
In many real-world applications, we are often facing the problem of cross domain learning, i. e., to borrow the labeled data or transfer the already learnt knowledge from a source domain to a target domain.