Search Results for author: Chong-Wah Ngo

Found 48 papers, 19 papers with code

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

no code implementations15 Jan 2025 YuAn Wang, Bin Xhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang

These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images.

Text-to-Image Generation

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

no code implementations19 Nov 2024 Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

Parameter-efficient fine-tuning multimodal large language models (MLLMs) presents significant challenges, including reliance on high-level visual features that limit fine-grained detail comprehension, and data conflicts that arise from task complexity.

parameter-efficient fine-tuning

Retrieval Augmented Recipe Generation

no code implementations13 Nov 2024 Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients.

Recipe Generation Retrieval

Hi3D: Pursuing High-Resolution Image-to-3D Generation with Video Diffusion Models

1 code implementation11 Sep 2024 Haibo Yang, Yang Chen, Yingwei Pan, Ting Yao, Zhineng Chen, Chong-Wah Ngo, Tao Mei

Despite having tremendous progress in image-to-3D generation, existing methods still struggle to produce multi-view consistent images with high-resolution textures in detail, especially in the paradigm of 2D diffusion that lacks 3D awareness.

3D Generation 3D Reconstruction +3

RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models

no code implementations17 Jul 2024 Pengkun Jiao, Xinlan Wu, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yugang Jiang

Uni-Food is designed to provide a more holistic approach to food data analysis, thereby enhancing the performance and capabilities of LMMs in this domain.


PosMLP-Video: Spatial and Temporal Relative Position Encoding for Efficient Video Recognition

1 code implementation3 Jul 2024 Yanbin Hao, Diansong Zhou, Zhicai Wang, Chong-Wah Ngo, Meng Wang

In recent years, vision Transformers and MLPs have demonstrated remarkable performance in image understanding tasks.

Position Video Recognition

Improving Interpretable Embeddings for Ad-hoc Video Search with Generative Captions and Multi-word Concept Bank

no code implementations9 Apr 2024 Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan

Experimental results show that the integration of the above-proposed elements doubles the R@1 performance of the AVS method on the MSRVTT dataset and improves the xinfAP on the TRECVid AVS query sets for 2016-2023 (eight years) by a margin from 2% to 77%, with an average about 20%.

Ad-hoc video search

OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation

no code implementations CVPR 2024 Xiongwei Wu, Sicheng Yu, Ee-Peng Lim, Chong-Wah Ngo

The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food, while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task.

Image Segmentation Image to text +2

Interpretable Embedding for Ad-hoc Video Search

1 code implementation19 Feb 2024 Jiaxin Wu, Chong-Wah Ngo

Answering query with semantic concepts has long been the mainstream approach for video search.

Ad-hoc video search

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

no code implementations22 Dec 2023 Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, Chong-Wah Ngo

In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain.

Food Recognition Multi-Task Learning +4

Incremental Learning on Food Instance Segmentation

no code implementations28 Jun 2023 Huu-Thanh Nguyen, Yu Cao, Chong-Wah Ngo, Wing-Kwong Chan

The power of the framework is a novel difficulty assessment model, which forecasts how challenging an unlabelled sample is to the latest trained instance segmentation model.

Incremental Learning Instance Segmentation +2

GroundNLQ @ Ego4D Natural Language Queries Challenge 2023

1 code implementation27 Jun 2023 Zhijian Hou, Lei Ji, Difei Gao, Wanjun Zhong, Kun Yan, Chao Li, Wing-Kwong Chan, Chong-Wah Ngo, Nan Duan, Mike Zheng Shou

Motivated by this, we leverage a two-stage pre-training strategy to train egocentric feature extractors and the grounding model on video narrations, and further fine-tune the model on annotated data.

Natural Language Queries

Interactive Video Corpus Moment Retrieval using Reinforcement Learning

no code implementations19 Feb 2023 Zhixin Ma, Chong-Wah Ngo

Nevertheless, when the first few pages of results are swamped with visually similar items, or the search target is hidden deep in the ranked list, finding the know-item target usually requires a long duration of browsing and result inspection.

Moment Retrieval reinforcement-learning +4

ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion

no code implementations ICCV 2023 Qi Cai, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei

Recent progress on multi-modal 3D object detection has featured BEV (Bird-Eye-View) based fusion, which effectively unifies both LiDAR point clouds and camera images in a shared BEV space.

3D Object Detection Depth Estimation +2

Dynamic Temporal Filtering in Video Models

1 code implementation15 Nov 2022 Fuchen Long, Zhaofan Qiu, Yingwei Pan, Ting Yao, Chong-Wah Ngo, Tao Mei

The pre-determined kernel size severely limits the temporal receptive fields and the fixed weights treat each spatial location across frames equally, resulting in sub-optimal solution for long-range temporal modeling in natural scenes.

CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding

1 code implementation22 Sep 2022 Zhijian Hou, Wanjun Zhong, Lei Ji, Difei Gao, Kun Yan, Wing-Kwong Chan, Chong-Wah Ngo, Zheng Shou, Nan Duan

This paper tackles an emerging and challenging problem of long video temporal grounding~(VTG) that localizes video moments related to a natural language (NL) query.

Contrastive Learning Video Grounding

Long-term Leap Attention, Short-term Periodic Shift for Video Classification

1 code implementation12 Jul 2022 Hao Zhang, Lechao Cheng, Yanbin Hao, Chong-Wah Ngo

By replacing a vanilla 2D attention with the LAPS, we could adapt a static transformer into a video one, with zero extra parameters and neglectable computation overhead ($\sim$2. 6\%).

Video Classification

Wave-ViT: Unifying Wavelet and Transformers for Visual Representation Learning

2 code implementations11 Jul 2022 Ting Yao, Yingwei Pan, Yehao Li, Chong-Wah Ngo, Tao Mei

Motivated by the wavelet theory, we construct a new Wavelet Vision Transformer (\textbf{Wave-ViT}) that formulates the invertible down-sampling with wavelet transforms and self-attention learning in a unified way.

Image Classification Instance Segmentation +4

(Un)likelihood Training for Interpretable Embedding

1 code implementation1 Jul 2022 Jiaxin Wu, Chong-Wah Ngo, Wing-Kwong Chan, Zhijian Hou

Cross-modal representation learning has become a new normal for bridging the semantic gap between text and visual data.

Ad-hoc video search Decoder +3

MLP-3D: A MLP-like 3D Architecture with Grouped Time Mixing

no code implementations CVPR 2022 Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

By deriving the novel grouped time mixing (GTM) operations, we equip the basic token-mixing MLP with the ability of temporal modeling.

3D Architecture Action Classification +2

Cross-lingual Adaptation for Recipe Retrieval with Mixup

no code implementations8 May 2022 Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Wing-Kwong Chan

To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space.

Retrieval Unsupervised Domain Adaptation

Adaptive Split-Fusion Transformer

1 code implementation26 Apr 2022 Zixuan Su, Hao Zhang, Jingjing Chen, Lei Pang, Chong-Wah Ngo, Yu-Gang Jiang

Neural networks for visual content understanding have recently evolved from convolutional ones (CNNs) to transformers.

Image Classification

Group Contextualization for Video Recognition

1 code implementation CVPR 2022 Yanbin Hao, Hao Zhang, Chong-Wah Ngo, Xiangnan He

By utilizing calibrators to embed feature with four different kinds of contexts in parallel, the learnt representation is expected to be more resilient to diverse types of activities.

Action Recognition Egocentric Activity Recognition +1

Condensing a Sequence to One Informative Frame for Video Recognition

no code implementations ICCV 2021 Zhaofan Qiu, Ting Yao, Yan Shu, Chong-Wah Ngo, Tao Mei

This paper studies a two-step alternative that first condenses the video sequence to an informative "frame" and then exploits off-the-shelf image recognition system on the synthetic frame.

Motion Estimation valid +1

Optimization Planning for 3D ConvNets

1 code implementation11 Jan 2022 Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Tao Mei

In this paper, we decompose the path into a series of training "states" and specify the hyper-parameters, e. g., learning rate and the length of input clips, in each state.

Video Recognition

CONQUER: Contextual Query-aware Ranking for Video Corpus Moment Retrieval

1 code implementation21 Sep 2021 Zhijian Hou, Chong-Wah Ngo, Wing Kwong Chan

This task is essential because advanced video retrieval applications should enable users to retrieve a precise moment from a large video corpus.

Corpus Video Moment Retrieval Moment Retrieval +6

Token Shift Transformer for Video Classification

3 code implementations5 Aug 2021 Hao Zhang, Yanbin Hao, Chong-Wah Ngo

It is worth noticing that our TokShift transformer is a pure convolutional-free video transformer pilot with computational efficiency for video understanding.

Classification Computational Efficiency +2

Pyramid Fusion Dark Channel Prior for Single Image Dehazing

no code implementations21 May 2021 Qiyuan Liang, Bin Zhu, Chong-Wah Ngo

In this paper, we propose the pyramid fusion dark channel prior (PF-DCP) for single image dehazing.

Image Dehazing Single Image Dehazing

Multi-modal Cooking Workflow Construction for Food Recipes

no code implementations20 Aug 2020 Liangming Pan, Jingjing Chen, Jianlong Wu, Shaoteng Liu, Chong-Wah Ngo, Min-Yen Kan, Yu-Gang Jiang, Tat-Seng Chua

Understanding food recipe requires anticipating the implicit causal effects of cooking actions, such that the recipe can be converted into a graph describing the temporal workflow of the recipe.

Common Sense Reasoning Decoder

Transferring and Regularizing Prediction for Semantic Segmentation

no code implementations CVPR 2020 Yiheng Zhang, Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Dong Liu, Tao Mei

In the view of extremely expensive expert labeling, recent research has shown that the models trained on photo-realistic synthetic data (e. g., computer games) with computer-generated annotations can be adapted to real images.

Domain Adaptation Segmentation +1

Exploring Category-Agnostic Clusters for Open-Set Domain Adaptation

no code implementations CVPR 2020 Yingwei Pan, Ting Yao, Yehao Li, Chong-Wah Ngo, Tao Mei

A clustering branch is capitalized on to ensure that the learnt representation preserves such underlying structure by matching the estimated assignment distribution over clusters to the inherent cluster distribution for each target sample.

Clustering Unsupervised Domain Adaptation

k-sums: another side of k-means

1 code implementation19 May 2020 Wan-Lei Zhao, Run-Qing Chen, Hui Ye, Chong-Wah Ngo

This optimization procedure converges faster to a better local minimum over k-means and many of its variants.

Clustering Stochastic Optimization

Deeply Activated Salient Region for Instance Search

no code implementations1 Feb 2020 Hui-Chu Xiao, Wan-Lei Zhao, Jie Lin, Chong-Wah Ngo

Due to the lack of proper mechanism in locating instances and deriving feature representation, instance search is generally only effective for retrieving instances of known object categories.

Image Retrieval Instance Search

On the Merge of k-NN Graph

1 code implementation2 Aug 2019 Wan-Lei Zhao, Hui Wang, Peng-Cheng Lin, Chong-Wah Ngo

Unfortunately, a closely related issue of how to merge two existing k-NN graphs has been overlooked.

graph construction Information Retrieval +1

vireoJD-MM at Activity Detection in Extended Videos

no code implementations20 Jun 2019 Fuchen Long, Qi Cai, Zhaofan Qiu, Zhijian Hou, Yingwei Pan, Ting Yao, Chong-Wah Ngo

This notebook paper presents an overview and comparative analysis of our system designed for activity detection in extended videos (ActEV-PC) in ActivityNet Challenge 2019.

Action Detection Action Localization +1

Learning Spatio-Temporal Representation with Local and Global Diffusion

no code implementations CVPR 2019 Zhaofan Qiu, Ting Yao, Chong-Wah Ngo, Xinmei Tian, Tao Mei

Diffusions effectively interact two aspects of information, i. e., localized and holistic, for more powerful way of representation learning.

Action Classification Action Detection +5

Transferrable Prototypical Networks for Unsupervised Domain Adaptation

no code implementations CVPR 2019 Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-Wah Ngo, Tao Mei

Specifically, we present Transferrable Prototypical Networks (TPN) for adaptation such that the prototypes for each class in source and target domains are close in the embedding space and the score distributions predicted by prototypes separately on source and target data are similar.

Pseudo Label Unsupervised Domain Adaptation

Exploring Object Relation in Mean Teacher for Cross-Domain Detection

1 code implementation CVPR 2019 Qi Cai, Yingwei Pan, Chong-Wah Ngo, Xinmei Tian, Ling-Yu Duan, Ting Yao

The whole architecture is then optimized with three consistency regularizations: 1) region-level consistency to align the region-level predictions between teacher and student, 2) inter-graph consistency for matching the graph structures between teacher and student, and 3) intra-graph consistency to enhance the similarity between regions of same class within the graph of student.

Relation Unsupervised Domain Adaptation

On the Selection of Anchors and Targets for Video Hyperlinking

no code implementations14 Apr 2018 Zhi-Qi Cheng, Hao Zhang, Xiao Wu, Chong-Wah Ngo

A principle way of hyperlinking can be carried out by picking centers of clusters as anchors and from there reach out to targets within or outside of clusters with consideration of neighborhood complexity.

Approximate k-NN Graph Construction: a Generic Online Approach

no code implementations9 Apr 2018 Wan-Lei Zhao, Hui Wang, Chong-Wah Ngo

On the one hand, the approximate k-nearest neighbor graph construction is treated as a search task.

graph construction Information Retrieval +1

Boost K-Means

no code implementations8 Oct 2016 Wan-Lei Zhao, Cheng-Hao Deng, Chong-Wah Ngo

The performance of k-means has been enhanced from different perspectives over the years.

Clustering Image Clustering +1

Learning Query and Image Similarities With Ranking Canonical Correlation Analysis

no code implementations ICCV 2015 Ting Yao, Tao Mei, Chong-Wah Ngo

One of the fundamental problems in image search is to learn the ranking functions, i. e., similarity between the query and image.

Image Retrieval

Semi-Supervised Domain Adaptation With Subspace Learning for Visual Recognition

no code implementations CVPR 2015 Ting Yao, Yingwei Pan, Chong-Wah Ngo, Houqiang Li, Tao Mei

In many real-world applications, we are often facing the problem of cross domain learning, i. e., to borrow the labeled data or transfer the already learnt knowledge from a source domain to a target domain.

Domain Adaptation Object Recognition +1

Cannot find the paper you are looking for? You can Submit a new open access paper.