1 code implementation • 12 Mar 2025 • ChengShu Zhao, Yunyang Ge, Xinhua Cheng, Bin Zhu, Yatian Pang, Bin Lin, Fan Yang, Feng Gao, Li Yuan
Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years.
1 code implementation • 10 Mar 2025 • Yuwei Niu, Munan Ning, Mengren Zheng, Bin Lin, Peng Jin, Jiaqi Liao, KunPeng Ning, Bin Zhu, Li Yuan
Text-to-Image (T2I) models are capable of generating high-quality artistic creations and visual content.
no code implementations • 7 Mar 2025 • Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, Patrick Carrington
OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log.
no code implementations • 6 Feb 2025 • Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen
We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations.
no code implementations • 31 Jan 2025 • Bin Zhu, Hui yan Qi, Yinxuan Gui, Jingjing Chen, Chong-Wah Ngo, Ee Peng Lim
Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks.
no code implementations • 15 Jan 2025 • YuAn Wang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang
These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images.
1 code implementation • 19 Dec 2024 • Yatian Pang, Peng Jin, Shuo Yang, Bin Lin, Bin Zhu, Zhenyu Tang, Liuhan Chen, Francis E. H. Tay, Ser-Nam Lim, Harry Yang, Li Yuan
Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks.
6 code implementations • 28 Nov 2024 • Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiaoyi Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, Li Yuan
We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs.
no code implementations • 19 Nov 2024 • Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Parameter-efficient fine-tuning multimodal large language models (MLLMs) presents significant challenges, including reliance on high-level visual features that limit fine-grained detail comprehension, and data conflicts that arise from task complexity.
no code implementations • 13 Nov 2024 • Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang
Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients.
no code implementations • 13 Nov 2024 • Linyang Wang, Wanquan Liu, Bin Zhu
Factor Analysis is about finding a low-rank plus sparse additive decomposition from a noisy estimate of the signal covariance matrix.
no code implementations • 17 Oct 2024 • Jielin Song, Siyu Liu, Bin Zhu, Yanghui Rao
As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses.
no code implementations • 16 Oct 2024 • Bin Zhu, Jiale Tang
This paper proposes a novel approach for line spectral estimation which combines Georgiou's filter bank (G-filter) with atomic norm minimization (ANM).
no code implementations • 16 Oct 2024 • Bin Zhu
In this paper, we develop a novel approach for line spectral estimation which combines ideas of Georgiou's filter banks (G-filters) and atomic norm minimization (ANM), a mainstream method for line spectral analysis in the last decade following the theory of compressed sensing.
1 code implementation • 2 Sep 2024 • Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan
With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are.
no code implementations • 28 Aug 2024 • Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao
The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation.
no code implementations • 17 Jul 2024 • Pengkun Jiao, Xinlan Wu, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yugang Jiang
Uni-Food is designed to provide a more holistic approach to food data analysis, thereby enhancing the performance and capabilities of LMMs in this domain.
1 code implementation • 16 Jul 2024 • Ouxiang Li, Yanbin Hao, Zhicai Wang, Bin Zhu, Shuo Wang, Zaixi Zhang, Fuli Feng
To alleviate these issues, leveraging on diffusion models' remarkable synthesis capabilities, we propose Diffusion-based Model Inversion (Diff-MI) attacks.
no code implementations • 19 Apr 2024 • Wenkai Liu, Tao Guan, Bin Zhu, Lili Ju, Zikai Song, Dan Li, Yuesong Wang, Wei Yang
In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology.
no code implementations • 1 Apr 2024 • Safwat Ali Khan, Wenyu Wang, Yiran Ren, Bin Zhu, Jiangfan Shi, Alyssa McGowan, Wing Lam, Kevin Moran
We evaluated AURORA both on a set of 12 apps with known tarpits from prior work, and on a new set of five of the most popular apps from the Google Play store.
no code implementations • 12 Mar 2024 • Guoshan Liu, Yang Jiao, Jingjing Chen, Bin Zhu, Yu-Gang Jiang
These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.
1 code implementation • 22 Feb 2024 • Bin Zhu, Munan Ning, Peng Jin, Bin Lin, Jinfa Huang, Qi Song, Junwu Zhang, Zhenyu Tang, Mingjun Pan, Xing Zhou, Li Yuan
In the multi-modal domain, the dependence of various models on specific input formats leads to user confusion and hinders progress.
no code implementations • 4 Feb 2024 • Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen
The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips.
no code implementations • 22 Dec 2023 • Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, Chong-Wah Ngo
In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain.
1 code implementation • 21 Dec 2023 • Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu
Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.
1 code implementation • 8 Dec 2023 • Fangzhou Song, Bin Zhu, Yanbin Hao, Shuo Wang
Leveraging on the remarkable capabilities of foundation models (i. e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart.
1 code implementation • 27 Nov 2023 • Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan
Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.
6 code implementations • 16 Nov 2023 • Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan
In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.
Ranked #9 on
Zero-Shot Video Question Answer
on TGIF-QA
no code implementations • 7 Oct 2023 • Lei Zhang, Hao Chen, Shu Hu, Bin Zhu, Ching Sheng Lin, Xi Wu, Jinrong Hu, Xin Wang
Generative adversarial networks (GANs) have remarkably advanced in diverse domains, especially image generation and editing.
6 code implementations • 3 Oct 2023 • Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, Hongfa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei Liu, Li Yuan
We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their corresponding Language, naming as VIDAL-10M.
Ranked #1 on
Zero-shot Audio Classification
on VGG-Sound
(using extra training data)
no code implementations • 30 Sep 2023 • Chengming Feng, Jing Hu, Xin Wang, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Siwei Lyu
Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters.
no code implementations • 30 Sep 2023 • Shanmin Yang, Hui Guo, Shu Hu, Bin Zhu, Ying Fu, Siwei Lyu, Xi Wu, Xin Wang
Deepfake technology poses a significant threat to security and social trust.
2 code implementations • 24 Sep 2023 • Xin Wang, Ziwei Luo, Jing Hu, Chengming Feng, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Xin Li, Siwei Lyu
The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image.
no code implementations • 23 Aug 2023 • Bin Zhu, Yijie Shi
This paper presents a Multiple Kernel Learning (abbreviated as MKL) framework for the Support Vector Machine (SVM) with the $(0, 1)$ loss function.
1 code implementation • 23 Aug 2023 • Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, Xiangnan He
Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance.
1 code implementation • ICCV 2023 • Sungwon Han, Sungwon Park, Fangzhao Wu, Sundong Kim, Bin Zhu, Xing Xie, Meeyoung Cha
Federated learning is used to train a shared model in a decentralized way without clients sharing private data with each other.
1 code implementation • 18 Jul 2023 • Sungwon Park, Sungwon Han, Fangzhao Wu, Sundong Kim, Bin Zhu, Xing Xie, Meeyoung Cha
Evaluations of real-world scenarios across multiple datasets show that the proposed method enhances the robustness of federated learning against model poisoning attacks.
1 code implementation • 17 May 2023 • Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, Xing Xie
Companies have begun to offer Embedding as a Service (EaaS) based on these LLMs, which can benefit various natural language processing (NLP) tasks for customers.
no code implementations • 19 Apr 2023 • Hao Chen, Peng Zheng, Xin Wang, Shu Hu, Bin Zhu, Jinrong Hu, Xi Wu, Siwei Lyu
As growing usage of social media websites in the recent decades, the amount of news articles spreading online rapidly, resulting in an unprecedented scale of potentially fraudulent information.
no code implementations • 8 Mar 2023 • Yijie Shi, Bin Zhu
We formulate the Multiple Kernel Learning (abbreviated as MKL) problem for the support vector machine with the infamous $(0, 1)$-loss function.
no code implementations • 26 Jan 2023 • Yunxu Xie, Shu Hu, Xin Wang, Quanyu Liao, Bin Zhu, Xi Wu, Siwei Lyu
Existing adversarial attacks on object detection focus on attacking anchor-based detectors, which may not work well for anchor-free detectors.
no code implementations • 17 Jan 2023 • Bin Zhu, Mattia Zorzi
We consider the problem to estimate the generalized cepstral coefficients of a stationary stochastic process or stationary multidimensional random field.
1 code implementation • 20 Oct 2022 • Yanfei Xiang, Xin Wang, Shu Hu, Bin Zhu, Xiaomeng Huang, Xi Wu, Siwei Lyu
Reinforcement learning is applied to solve actual complex tasks from high-dimensional, sensory inputs.
no code implementations • 6 Oct 2022 • Xue Song, Jingjing Chen, Bin Zhu, Yu-Gang Jiang
Specifically, appearance and motion components are provided by the image and caption separately.
3 code implementations • 26 Sep 2022 • Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, Dima Damen
VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets.
no code implementations • 22 May 2022 • Jingwei Yi, Fangzhao Wu, Huishuai Zhang, Bin Zhu, Tao Qi, Guangzhong Sun, Xing Xie
Federated learning (FL) enables multiple clients to collaboratively train models without sharing their local data, and becomes an important privacy-preserving machine learning framework.
no code implementations • 8 May 2022 • Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Wing-Kwong Chan
To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space.
no code implementations • Findings (ACL) 2022 • Bin Zhu, Zhaoquan Gu, Le Wang, Jinyin Chen, Qi Xuan
On top of FADA, we propose geometry-aware adversarial training (GAT) to perform adversarial training on friendly adversarial data so that we can save a large number of search steps.
1 code implementation • 27 Mar 2022 • Yu Zhang, Yun Wang, Haidong Zhang, Bin Zhu, Siming Chen, Dongmei Zhang
In this paper, we propose a conceptual framework for data labeling and OneLabeler based on the conceptual framework to support easy building of labeling tools for diverse usage scenarios.
1 code implementation • 14 Feb 2022 • Jingwei Yi, Fangzhao Wu, Bin Zhu, Jing Yao, Zhulin Tao, Guangzhong Sun, Xing Xie
Our study reveals a critical security issue in existing federated news recommendation systems and calls for research efforts to address the issue.
no code implementations • 13 Sep 2021 • Bin Zhu, Zhaoquan Gu, Le Wang, Zhihong Tian
Recent work shows that deep neural networks are vulnerable to adversarial examples.
no code implementations • 3 Jun 2021 • Quanyu Liao, Yuezun Li, Xin Wang, Bin Kong, Bin Zhu, Siwei Lyu, Youbing Yin, Qi Song, Xi Wu
Fooling people with highly realistic fake images generated with Deepfake or GANs brings a great social disturbance to our society.
no code implementations • 3 Jun 2021 • Quanyu Liao, Xin Wang, Bin Kong, Siwei Lyu, Bin Zhu, Youbing Yin, Qi Song, Xi Wu
Deep neural networks have been demonstrated to be vulnerable to adversarial attacks: subtle perturbation can completely change prediction result.
no code implementations • 21 May 2021 • Qiyuan Liang, Bin Zhu, Chong-Wah Ngo
In this paper, we propose the pyramid fusion dark channel prior (PF-DCP) for single image dehazing.
no code implementations • 5 Mar 2021 • Yiming Li, Shan Liu, Yu Chen, Yushan Zheng, Sijia Chen, Bin Zhu, Jian Lou
As the successor of H. 265/HEVC, the new versatile video coding standard (H. 266/VVC) can provide up to 50% bitrate saving with the same subjective quality, at the cost of increased decoding complexity.
no code implementations • 17 Dec 2020 • Victor V. Flambaum, Liangliang Su, Lei Wu, Bin Zhu
Due to the low nuclear recoils, sub-GeV dark matter (DM) is usually beyond the sensitivity of the conventional DM direct detection experiments.
High Energy Physics - Phenomenology Cosmology and Nongalactic Astrophysics
no code implementations • 24 Jun 2020 • Bin Zhu
A positive semidefinite Toeplitz matrix, which often arises as the finite covariance matrix of a stationary random process, can be decomposed as the sum of a nonnegative multiple of the identity corresponding to a white noise, and a singular term corresponding to a purely deterministic process.
no code implementations • CVPR 2020 • Bin Zhu, Chong-Wah Ngo
Particularly, a cooking simulator sub-network is proposed to incrementally make changes to food images based on the interaction between ingredients and cooking methods over a series of steps.
1 code implementation • 7 Mar 2020 • Bin Zhu, Qing Song, Lu Yang, Zhihui Wang, Chun Liu, Mengjie Hu
In object detection, offset-guided and point-guided regression dominate anchor-based and anchor-free method separately.
no code implementations • 21 Oct 2019 • Giorgio Picci, Bin Zhu
In this paper we show that the classical problem of frequency estimation can be formulated and solved efficiently in an empirical Bayesian framework by assigning a uniform a priori probability distribution to the unknown frequency.
no code implementations • 2 Oct 2019 • Bin Zhu, Xin Guo, Kenneth Barner, Charles Boncelet
The task is to predict the cohesive level for a group of people in images.
1 code implementation • 19 Sep 2019 • Xin Guo, Luisa F. Polania, Bin Zhu, Charles Boncelet, Kenneth E. Barner
A graph neural network (GNN) for image understanding based on multiple cues is proposed in this paper.
no code implementations • CVPR 2019 • Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Yanbin Hao
Representing procedure text such as recipe for crossmodal retrieval is inherently a difficult problem, not mentioning to generate image from recipe for visualization.