Search Results for author: Bin Zhu

Found 63 papers, 23 papers with code

SwapAnyone: Consistent and Realistic Video Synthesis for Swapping Any Person into Any Video

1 code implementation12 Mar 2025 ChengShu Zhao, Yunyang Ge, Xinhua Cheng, Bin Zhu, Yatian Pang, Bin Lin, Fan Yang, Feng Gao, Li Yuan

Video body-swapping aims to replace the body in an existing video with a new body from arbitrary sources, which has garnered more attention in recent years.

Video Inpainting

OSCAR: Object Status and Contextual Awareness for Recipes to Support Non-Visual Cooking

no code implementations7 Mar 2025 Franklin Mingzhe Li, Kaitlyn Ng, Bin Zhu, Patrick Carrington

OSCAR leverages both Large-Language Models (LLMs) and Vision-Language Models (VLMs) to manipulate recipe steps, extract object status information, align visual frames with object status, and provide cooking progress tracking log.

Object

HD-EPIC: A Highly-Detailed Egocentric Video Dataset

no code implementations6 Feb 2025 Toby Perrett, Ahmad Darkhalil, Saptarshi Sinha, Omar Emara, Sam Pollard, Kranti Parida, Kaiting Liu, Prajwal Gatti, Siddhant Bansal, Kevin Flanagan, Jacob Chalk, Zhifan Zhu, Rhodri Guerrier, Fahd Abdelazim, Bin Zhu, Davide Moltisanti, Michael Wray, Hazel Doughty, Dima Damen

We present a validation dataset of newly-collected kitchen-based egocentric videos, manually annotated with highly detailed and interconnected ground-truth labels covering: recipe steps, fine-grained actions, ingredients with nutritional values, moving objects, and audio annotations.

Action Recognition Nutrition +5

Calling a Spade a Heart: Gaslighting Multimodal Large Language Models via Negation

no code implementations31 Jan 2025 Bin Zhu, Hui yan Qi, Yinxuan Gui, Jingjing Chen, Chong-Wah Ngo, Ee Peng Lim

Multimodal Large Language Models (MLLMs) have exhibited remarkable advancements in integrating different modalities, excelling in complex understanding and generation tasks.

Negation

CookingDiffusion: Cooking Procedural Image Generation with Stable Diffusion

no code implementations15 Jan 2025 YuAn Wang, Bin Zhu, Yanbin Hao, Chong-Wah Ngo, Yi Tan, Xiang Wang

These prompts encompass text prompts (representing cooking steps), image prompts (corresponding to cooking images), and multi-modal prompts (mixing cooking steps and images), ensuring the consistent generation of cooking procedural images.

Text-to-Image Generation

Next Patch Prediction for Autoregressive Visual Generation

1 code implementation19 Dec 2024 Yatian Pang, Peng Jin, Shuo Yang, Bin Lin, Bin Zhu, Zhenyu Tang, Liuhan Chen, Francis E. H. Tay, Ser-Nam Lim, Harry Yang, Li Yuan

Autoregressive models, built based on the Next Token Prediction (NTP) paradigm, show great potential in developing a unified framework that integrates both language and vision tasks.

Image Generation Prediction

Open-Sora Plan: Open-Source Large Video Generation Model

6 code implementations28 Nov 2024 Bin Lin, Yunyang Ge, Xinhua Cheng, Zongjian Li, Bin Zhu, Shaodong Wang, Xianyi He, Yang Ye, Shenghai Yuan, Liuhan Chen, Tanghui Jia, Junwu Zhang, Zhenyu Tang, Yatian Pang, Bin She, Cen Yan, Zhiheng Hu, Xiaoyi Dong, Lin Chen, Zhang Pan, Xing Zhou, Shaoling Dong, Yonghong Tian, Li Yuan

We introduce Open-Sora Plan, an open-source project that aims to contribute a large generation model for generating desired high-resolution videos with long durations based on various user inputs.

Video Generation

Visual Cue Enhancement and Dual Low-Rank Adaptation for Efficient Visual Instruction Fine-Tuning

no code implementations19 Nov 2024 Pengkun Jiao, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

Parameter-efficient fine-tuning multimodal large language models (MLLMs) presents significant challenges, including reliance on high-level visual features that limit fine-grained detail comprehension, and data conflicts that arise from task complexity.

parameter-efficient fine-tuning

Retrieval Augmented Recipe Generation

no code implementations13 Nov 2024 Guoshan Liu, Hailong Yin, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yu-Gang Jiang

Existing works for recipe generation primarily utilize a two-stage training method, first generating ingredients and then obtaining instructions from both the image and ingredients.

Recipe Generation Retrieval

$\ell_0$ factor analysis

no code implementations13 Nov 2024 Linyang Wang, Wanquan Liu, Bin Zhu

Factor Analysis is about finding a low-rank plus sparse additive decomposition from a noisy estimate of the signal covariance matrix.

IterSelectTune: An Iterative Training Framework for Efficient Instruction-Tuning Data Selection

no code implementations17 Oct 2024 Jielin Song, Siyu Liu, Bin Zhu, Yanghui Rao

As large language models (LLMs) continue to advance, instruction tuning has become critical for improving their ability to generate accurate and contextually appropriate responses.

When atomic norm meets the G-filter: A general framework for line spectral estimation

no code implementations16 Oct 2024 Bin Zhu, Jiale Tang

This paper proposes a novel approach for line spectral estimation which combines Georgiou's filter bank (G-filter) with atomic norm minimization (ANM).

Line Spectral Analysis Using the G-Filter: An Atomic Norm Minimization Approach

no code implementations16 Oct 2024 Bin Zhu

In this paper, we develop a novel approach for line spectral estimation which combines ideas of Georgiou's filter banks (G-filters) and atomic norm minimization (ANM), a mainstream method for line spectral analysis in the last decade following the theory of compressed sensing.

OD-VAE: An Omni-dimensional Video Compressor for Improving Latent Video Diffusion Model

1 code implementation2 Sep 2024 Liuhan Chen, Zongjian Li, Bin Lin, Bin Zhu, Qian Wang, Shenghai Yuan, Xing Zhou, Xinhua Cheng, Li Yuan

With the same reconstruction quality, the more sufficient the VAE's compression for videos is, the more efficient the LVDMs are.

Video Generation Video Reconstruction

Hand1000: Generating Realistic Hands from Text with Only 1,000 Images

no code implementations28 Aug 2024 Haozhuo Zhang, Bin Zhu, Yu Cao, Yanbin Hao

The training of Hand1000 is divided into three stages with the first stage aiming to enhance the model's understanding of hand anatomy by using a pre-trained hand gesture recognition model to extract gesture representation.

Anatomy Hand Gesture Recognition +3

RoDE: Linear Rectified Mixture of Diverse Experts for Food Large Multi-Modal Models

no code implementations17 Jul 2024 Pengkun Jiao, Xinlan Wu, Bin Zhu, Jingjing Chen, Chong-Wah Ngo, Yugang Jiang

Uni-Food is designed to provide a more holistic approach to food data analysis, thereby enhancing the performance and capabilities of LMMs in this domain.

Nutrition

Model Inversion Attacks Through Target-Specific Conditional Diffusion Models

1 code implementation16 Jul 2024 Ouxiang Li, Yanbin Hao, Zhicai Wang, Bin Zhu, Shuo Wang, Zaixi Zhang, Fuli Feng

To alleviate these issues, leveraging on diffusion models' remarkable synthesis capabilities, we propose Diffusion-based Model Inversion (Diff-MI) attacks.

Image Reconstruction

EfficientGS: Streamlining Gaussian Splatting for Large-Scale High-Resolution Scene Representation

no code implementations19 Apr 2024 Wenkai Liu, Tao Guan, Bin Zhu, Lili Ju, Zikai Song, Dan Li, Yuesong Wang, Wei Yang

In the domain of 3D scene representation, 3D Gaussian Splatting (3DGS) has emerged as a pivotal technology.

3DGS 4k

AURORA: Navigating UI Tarpits via Automated Neural Screen Understanding

no code implementations1 Apr 2024 Safwat Ali Khan, Wenyu Wang, Yiran Ren, Bin Zhu, Jiangfan Shi, Alyssa McGowan, Wing Lam, Kevin Moran

We evaluated AURORA both on a set of 12 apps with known tarpits from prior work, and on a new set of five of the most popular apps from the Google Play store.

Navigate

From Canteen Food to Daily Meals: Generalizing Food Recognition to More Practical Scenarios

no code implementations12 Mar 2024 Guoshan Liu, Yang Jiao, Jingjing Chen, Bin Zhu, Yu-Gang Jiang

These two datasets are used to evaluate the transferability of approaches from the well-curated food image domain to the everyday-life food image domain.

Food Recognition

LLMBind: A Unified Modality-Task Integration Framework

1 code implementation22 Feb 2024 Bin Zhu, Munan Ning, Peng Jin, Bin Lin, Jinfa Huang, Qi Song, Junwu Zhang, Zhenyu Tang, Mingjun Pan, Xing Zhou, Li Yuan

In the multi-modal domain, the dependence of various models on specific input formats leads to user confusion and hinders progress.

AI Agent Audio Generation +4

Video Editing for Video Retrieval

no code implementations4 Feb 2024 Bin Zhu, Kevin Flanagan, Adriano Fragomeni, Michael Wray, Dima Damen

The teacher model is employed to edit the clips in the training set whereas the student model trains on the edited clips.

Text Retrieval Video Editing +1

FoodLMM: A Versatile Food Assistant using Large Multi-modal Model

no code implementations22 Dec 2023 Yuehao Yin, Huiyan Qi, Bin Zhu, Jingjing Chen, Yu-Gang Jiang, Chong-Wah Ngo

In the second stage, we construct a multi-round conversation dataset and a reasoning segmentation dataset to fine-tune the model, enabling it to conduct professional dialogues and generate segmentation masks based on complex reasoning in the food domain.

Food Recognition Multi-Task Learning +4

Benchmarking and Defending Against Indirect Prompt Injection Attacks on Large Language Models

1 code implementation21 Dec 2023 Jingwei Yi, Yueqi Xie, Bin Zhu, Emre Kiciman, Guangzhong Sun, Xing Xie, Fangzhao Wu

Our analysis identifies two key factors contributing to their success: LLMs' inability to distinguish between informational context and actionable instructions, and their lack of awareness in avoiding the execution of instructions within external content.

Benchmarking

Enhancing Recipe Retrieval with Foundation Models: A Data Augmentation Perspective

1 code implementation8 Dec 2023 Fangzhou Song, Bin Zhu, Yanbin Hao, Shuo Wang

Leveraging on the remarkable capabilities of foundation models (i. e., Llama2 and SAM), we propose to augment recipe and food image by extracting alignable information related to the counterpart.

Cross-Modal Retrieval Data Augmentation +2

Video-Bench: A Comprehensive Benchmark and Toolkit for Evaluating Video-based Large Language Models

1 code implementation27 Nov 2023 Munan Ning, Bin Zhu, Yujia Xie, Bin Lin, Jiaxi Cui, Lu Yuan, Dongdong Chen, Li Yuan

Video-based large language models (Video-LLMs) have been recently introduced, targeting both fundamental improvements in perception and comprehension, and a diverse range of user inquiries.

Decision Making Question Answering

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

6 code implementations16 Nov 2023 Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM.

Language Modeling Language Modelling +5

Controlling Neural Style Transfer with Deep Reinforcement Learning

no code implementations30 Sep 2023 Chengming Feng, Jing Hu, Xin Wang, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Siwei Lyu

Controlling the degree of stylization in the Neural Style Transfer (NST) is a little tricky since it usually needs hand-engineering on hyper-parameters.

Deep Reinforcement Learning reinforcement-learning +2

RL-I2IT: Image-to-Image Translation with Deep Reinforcement Learning

2 code implementations24 Sep 2023 Xin Wang, Ziwei Luo, Jing Hu, Chengming Feng, Shu Hu, Bin Zhu, Xi Wu, Hongtu Zhu, Xin Li, Siwei Lyu

The key feature in the RL-I2IT framework is to decompose a monolithic learning process into small steps with a lightweight model to progressively transform a source image successively to a target image.

Auxiliary Learning Decision Making +4

MKL-$L_{0/1}$-SVM

no code implementations23 Aug 2023 Bin Zhu, Yijie Shi

This paper presents a Multiple Kernel Learning (abbreviated as MKL) framework for the Support Vector Machine (SVM) with the $(0, 1)$ loss function.

CgT-GAN: CLIP-guided Text GAN for Image Captioning

1 code implementation23 Aug 2023 Jiarui Yu, Haoran Li, Yanbin Hao, Bin Zhu, Tong Xu, Xiangnan He

Particularly, we use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus and CLIP-based reward to provide semantic guidance.

Image Captioning

Towards Attack-tolerant Federated Learning via Critical Parameter Analysis

1 code implementation ICCV 2023 Sungwon Han, Sungwon Park, Fangzhao Wu, Sundong Kim, Bin Zhu, Xing Xie, Meeyoung Cha

Federated learning is used to train a shared model in a decentralized way without clients sharing private data with each other.

Federated Learning

FedDefender: Client-Side Attack-Tolerant Federated Learning

1 code implementation18 Jul 2023 Sungwon Park, Sungwon Han, Fangzhao Wu, Sundong Kim, Bin Zhu, Xing Xie, Meeyoung Cha

Evaluations of real-world scenarios across multiple datasets show that the proposed method enhances the robustness of federated learning against model poisoning attacks.

Federated Learning Knowledge Distillation +1

Are You Copying My Model? Protecting the Copyright of Large Language Models for EaaS via Backdoor Watermark

1 code implementation17 May 2023 Wenjun Peng, Jingwei Yi, Fangzhao Wu, Shangxi Wu, Bin Zhu, Lingjuan Lyu, Binxing Jiao, Tong Xu, Guangzhong Sun, Xing Xie

Companies have begun to offer Embedding as a Service (EaaS) based on these LLMs, which can benefit various natural language processing (NLP) tasks for customers.

Model extraction

Harnessing the Power of Text-image Contrastive Models for Automatic Detection of Online Misinformation

no code implementations19 Apr 2023 Hao Chen, Peng Zheng, Xin Wang, Shu Hu, Bin Zhu, Jinrong Hu, Xi Wu, Siwei Lyu

As growing usage of social media websites in the recent decades, the amount of news articles spreading online rapidly, resulting in an unprecedented scale of potentially fraudulent information.

Contrastive Learning Misinformation +1

An ADMM Solver for the MKL-$L_{0/1}$-SVM

no code implementations8 Mar 2023 Yijie Shi, Bin Zhu

We formulate the Multiple Kernel Learning (abbreviated as MKL) problem for the support vector machine with the infamous $(0, 1)$-loss function.

Attacking Important Pixels for Anchor-free Detectors

no code implementations26 Jan 2023 Yunxu Xie, Shu Hu, Xin Wang, Quanyu Liao, Bin Zhu, Xi Wu, Siwei Lyu

Existing adversarial attacks on object detection focus on attacking anchor-based detectors, which may not work well for anchor-free detectors.

Adversarial Attack object-detection +2

On the Statistical Consistency of a Generalized Cepstral Estimator

no code implementations17 Jan 2023 Bin Zhu, Mattia Zorzi

We consider the problem to estimate the generalized cepstral coefficients of a stationary stochastic process or stationary multidimensional random field.

Text-driven Video Prediction

no code implementations6 Oct 2022 Xue Song, Jingjing Chen, Bin Zhu, Yu-Gang Jiang

Specifically, appearance and motion components are provided by the image and caption separately.

Causal Inference Prediction +2

EPIC-KITCHENS VISOR Benchmark: VIdeo Segmentations and Object Relations

3 code implementations26 Sep 2022 Ahmad Darkhalil, Dandan Shan, Bin Zhu, Jian Ma, Amlan Kar, Richard Higgins, Sanja Fidler, David Fouhey, Dima Damen

VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets.

Object Segmentation +4

Robust Quantity-Aware Aggregation for Federated Learning

no code implementations22 May 2022 Jingwei Yi, Fangzhao Wu, Huishuai Zhang, Bin Zhu, Tao Qi, Guangzhong Sun, Xing Xie

Federated learning (FL) enables multiple clients to collaboratively train models without sharing their local data, and becomes an important privacy-preserving machine learning framework.

Federated Learning Privacy Preserving

Cross-lingual Adaptation for Recipe Retrieval with Mixup

no code implementations8 May 2022 Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Wing-Kwong Chan

To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space.

Retrieval Unsupervised Domain Adaptation

Improving robustness of language models from a geometry-aware perspective

no code implementations Findings (ACL) 2022 Bin Zhu, Zhaoquan Gu, Le Wang, Jinyin Chen, Qi Xuan

On top of FADA, we propose geometry-aware adversarial training (GAT) to perform adversarial training on friendly adversarial data so that we can save a large number of search steps.

Data Augmentation

OneLabeler: A Flexible System for Building Data Labeling Tools

1 code implementation27 Mar 2022 Yu Zhang, Yun Wang, Haidong Zhang, Bin Zhu, Siming Chen, Dongmei Zhang

In this paper, we propose a conceptual framework for data labeling and OneLabeler based on the conceptual framework to support easy building of labeling tools for diverse usage scenarios.

UA-FedRec: Untargeted Attack on Federated News Recommendation

1 code implementation14 Feb 2022 Jingwei Yi, Fangzhao Wu, Bin Zhu, Jing Yao, Zhulin Tao, Guangzhong Sun, Xing Xie

Our study reveals a critical security issue in existing federated news recommendation systems and calls for research efforts to address the issue.

Federated Learning News Recommendation +2

TREATED:Towards Universal Defense against Textual Adversarial Attacks

no code implementations13 Sep 2021 Bin Zhu, Zhaoquan Gu, Le Wang, Zhihong Tian

Recent work shows that deep neural networks are vulnerable to adversarial examples.

Adversarial Defense

Imperceptible Adversarial Examples for Fake Image Detection

no code implementations3 Jun 2021 Quanyu Liao, Yuezun Li, Xin Wang, Bin Kong, Bin Zhu, Siwei Lyu, Youbing Yin, Qi Song, Xi Wu

Fooling people with highly realistic fake images generated with Deepfake or GANs brings a great social disturbance to our society.

Face Swapping Fake Image Detection

Transferable Adversarial Examples for Anchor Free Object Detection

no code implementations3 Jun 2021 Quanyu Liao, Xin Wang, Bin Kong, Siwei Lyu, Bin Zhu, Youbing Yin, Qi Song, Xi Wu

Deep neural networks have been demonstrated to be vulnerable to adversarial attacks: subtle perturbation can completely change prediction result.

Adversarial Attack Object +2

Pyramid Fusion Dark Channel Prior for Single Image Dehazing

no code implementations21 May 2021 Qiyuan Liang, Bin Zhu, Chong-Wah Ngo

In this paper, we propose the pyramid fusion dark channel prior (PF-DCP) for single image dehazing.

Image Dehazing Single Image Dehazing

An Optimized H.266/VVC Software Decoder On Mobile Platform

no code implementations5 Mar 2021 Yiming Li, Shan Liu, Yu Chen, Yushan Zheng, Sijia Chen, Bin Zhu, Jian Lou

As the successor of H. 265/HEVC, the new versatile video coding standard (H. 266/VVC) can provide up to 50% bitrate saving with the same subjective quality, at the cost of increased decoding complexity.

4k Decoder

New Strong Bounds on sub-GeV Dark Matter from Boosted and Migdal Effects

no code implementations17 Dec 2020 Victor V. Flambaum, Liangliang Su, Lei Wu, Bin Zhu

Due to the low nuclear recoils, sub-GeV dark matter (DM) is usually beyond the sensitivity of the conventional DM direct detection experiments.

High Energy Physics - Phenomenology Cosmology and Nongalactic Astrophysics

Line Spectrum Representation for Vector Processes With Application to Frequency Estimation

no code implementations24 Jun 2020 Bin Zhu

A positive semidefinite Toeplitz matrix, which often arises as the finite covariance matrix of a stationary random process, can be decomposed as the sum of a nonnegative multiple of the identity corresponding to a white noise, and a singular term corresponding to a purely deterministic process.

Time Series Analysis

CookGAN: Causality Based Text-to-Image Synthesis

no code implementations CVPR 2020 Bin Zhu, Chong-Wah Ngo

Particularly, a cooking simulator sub-network is proposed to incrementally make changes to food images based on the interaction between ingredients and cooking methods over a series of steps.

Image Generation

CPM R-CNN: Calibrating Point-guided Misalignment in Object Detection

1 code implementation7 Mar 2020 Bin Zhu, Qing Song, Lu Yang, Zhihui Wang, Chun Liu, Mengjie Hu

In object detection, offset-guided and point-guided regression dominate anchor-based and anchor-free method separately.

object-detection Object Detection

An Empirical Bayes Approach to Frequency Estimation

no code implementations21 Oct 2019 Giorgio Picci, Bin Zhu

In this paper we show that the classical problem of frequency estimation can be formulated and solved efficiently in an empirical Bayesian framework by assigning a uniform a priori probability distribution to the unknown frequency.

R2GAN: Cross-Modal Recipe Retrieval With Generative Adversarial Network

no code implementations CVPR 2019 Bin Zhu, Chong-Wah Ngo, Jingjing Chen, Yanbin Hao

Representing procedure text such as recipe for crossmodal retrieval is inherently a difficult problem, not mentioning to generate image from recipe for visualization.

Generative Adversarial Network Retrieval

Cannot find the paper you are looking for? You can Submit a new open access paper.