Search Results for author: Hao Tang

Found 308 papers, 138 papers with code

ScoreAdv: Score-based Targeted Generation of Natural Adversarial Examples via Diffusion Models

1 code implementation8 Jul 2025 Chihan Huang, Hao Tang

In this paper, we introduce a novel approach for generating UAEs based on diffusion models, named ScoreAdv.

Adversarial Attack Denoising

Fine-grained Image Retrieval via Dual-Vision Adaptation

no code implementations19 Jun 2025 Xin Jiang, Meiqi Cao, Hao Tang, Fei Shen, Zechao Li

Fine-Grained Image Retrieval~(FGIR) faces challenges in learning discriminative visual representations to retrieve images with similar fine-grained features.

Image Retrieval Knowledge Distillation +1

Learning to Reason Across Parallel Samples for LLM Reasoning

no code implementations10 Jun 2025 Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, Eunsol Choi

By separating LLMs to generate answers and LLMs to analyze and aggregate sampled answers, our approach can work with the outputs from premier black box models easily and efficiently.

Math Re-Ranking

Learning Compact Vision Tokens for Efficient Large Multimodal Models

1 code implementation8 Jun 2025 Hao Tang, Chengchao Shen

Specifically, we propose a Spatial Token Fusion (STF) method to learn compact vision tokens for short vision token sequence, where spatial-adjacent tokens are fused into one.

Multimodal Reasoning Token Reduction

Token Transforming: A Unified and Training-Free Token Compression Framework for Vision Transformer Acceleration

no code implementations6 Jun 2025 Fanhu Zeng, Deli Yu, Zhenglun Kong, Hao Tang

In this paper, we rethink token reduction and unify the process as an explicit form of token matrix transformation, in which all existing methods are constructing special forms of matrices within the framework.

Depth Estimation object-detection +2

Enhancing Diffusion-based Unrestricted Adversarial Attacks via Adversary Preferences Alignment

no code implementations2 Jun 2025 Kaixun Jiang, Zhaoyu Chen, Haijing Guo, Jinglun Li, Jiyuan Fu, Pinxue Guo, Hao Tang, Bo Li, Wenqiang Zhang

Unlike benign alignment, adversarial alignment involves two inherently conflicting preferences: visual consistency and attack effectiveness, which often lead to unstable optimization and reward hacking (e. g., reducing visual quality to improve attack success).

FOLIAGE: Towards Physical Intelligence World Models Via Unbounded Surface Evolution

no code implementations29 May 2025 Xiaoyi Liu, Hao Tang

We propose FOLIAGE, a physics-informed multimodal world model for unbounded accretive surface growth.

counterfactual Cross-Modal Retrieval

Enabling Flexible Multi-LLM Integration for Scalable Knowledge Aggregation

1 code implementation28 May 2025 Zhenglun Kong, Zheng Zhan, Shiyue Hou, Yifan Gong, Xin Meng, Pengwei Sui, Peiyan Dong, Xuan Shen, Zifeng Wang, Pu Zhao, Hao Tang, Stratis Ioannidis, Yanzhi Wang

To address these issues, we propose a framework that adaptively selects and aggregates knowledge from diverse LLMs to build a single, stronger model, avoiding the high memory overhead of ensemble and inflexible weight merging.

Effective Context in Neural Speech Models

no code implementations28 May 2025 Yen Meng, Sharon Goldwater, Hao Tang

Modern neural speech models benefit from having longer context, and many approaches have been proposed to increase the maximum context a model can use.

SpikeStereoNet: A Brain-Inspired Framework for Stereo Depth Estimation from Spike Streams

no code implementations26 May 2025 Zhuoheng Gao, Yihao Li, Jiyao Zhang, Rui Zhao, Tong Wu, Hao Tang, Zhaofei Yu, Hao Dong, Guozhang Chen, Tiejun Huang

To address this gap, we propose SpikeStereoNet, a brain-inspired framework and the first to estimate stereo depth directly from raw spike streams.

Stereo Depth Estimation

Token Reduction Should Go Beyond Efficiency in Generative Models -- From Vision, Language to Multimodality

1 code implementation23 May 2025 Zhenglun Kong, Yize Li, Fanhu Zeng, Lei Xin, Shvat Messica, Xue Lin, Pu Zhao, Manolis Kellis, Hao Tang, Marinka Zitnik

We highlight its potential to drive new model architectures and learning strategies that improve robustness, increase interpretability, and better align with the objectives of generative modeling.

In-Context Learning Token Reduction

SAMba-UNet: Synergizing SAM2 and Mamba in UNet with Heterogeneous Aggregation for Cardiac MRI Segmentation

no code implementations22 May 2025 Guohao Huo, Ruiting Dai, Hao Tang

To address the challenge of complex pathological feature extraction in automated cardiac MRI segmentation, this study proposes an innovative dual-encoder architecture named SAMba-UNet.

Mamba MRI segmentation

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

no code implementations22 May 2025 Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Dahua Lin, Matt Feiszli, Kevin J. Liang

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for robotics and other real-world applications that require multi-frame reasoning.

Programmatic Video Prediction Using Large Language Models

1 code implementation20 May 2025 Hao Tang, Kevin Ellis, Suhas Lohit, Michael J. Jones, Moitreya Chatterjee

The task of estimating the world model describing the dynamics of a real world process assumes immense importance for anticipating and preparing for future outcomes.

Autonomous Driving Prediction +2

Replace in Translation: Boost Concept Alignment in Counterfactual Text-to-Image

no code implementations20 May 2025 Sifan Li, Ming Tao, Hao Zhao, Ling Shao, Hao Tang

For those scenes that are impossible to happen in real world and anti-physics, we should spare no efforts in increasing the factual feel, which means synthesizing images that people think very likely to be happening, and concept alignment, which means all the required objects should be in the same frame.

Concept Alignment counterfactual

CtrlDiff: Boosting Large Diffusion Language Models with Dynamic Block Prediction and Controllable Generation

no code implementations20 May 2025 Chihan Huang, Hao Tang

Although autoregressive models have dominated language modeling in recent years, there has been a growing interest in exploring alternative paradigms to the conventional next-token prediction framework.

Conditional Text Generation Language Modeling +1

Structured Agent Distillation for Large Language Model

no code implementations20 May 2025 Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Tianqi Li, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Pu Zhao, Xue Lin, Dong Huang, Yanzhi Wang

Large language models (LLMs) exhibit strong capabilities as decision-making agents by interleaving reasoning and actions, as seen in ReAct-style frameworks.

Imitation Learning Language Modeling +3

Improved Algorithms for Differentially Private Language Model Alignment

no code implementations13 May 2025 Keyu Chen, Hao Tang, Qinglin Liu, Yizhao Xu

Language model alignment is crucial for ensuring that large language models (LLMs) align with human preferences, yet it often involves sensitive user data, raising significant privacy concerns.

Language Modeling Language Modelling +2

Semantic-Guided Diffusion Model for Single-Step Image Super-Resolution

1 code implementation11 May 2025 Zihang Liu, Zhenyu Zhang, Hao Tang

To address this limitation, we propose SAMSR, a semantic-guided diffusion framework that incorporates semantic segmentation masks into the sampling process.

Image Super-Resolution Semantic Segmentation

Multimodal Large Language Models for Medicine: A Comprehensive Survey

no code implementations29 Apr 2025 Jiarui Ye, Hao Tang

At the end of the survey, we discuss the challenges faced by MLLMs in the medical and healthcare domain and propose feasible methods to mitigate or overcome these issues.

Medical Diagnosis Survey

EventVAD: Training-Free Event-Aware Video Anomaly Detection

no code implementations17 Apr 2025 Yihua Shao, Haojin He, Sijie Li, Siyu Chen, Xinwei Long, Fanhu Zeng, Yuxuan Fan, Muyang Zhang, Ziyang Yan, Ao Ma, Xiaochen Wang, Hao Tang, Yan Wang, Shuyan Li

Therefore, we propose EventVAD, an event-aware video anomaly detection framework that combines tailored dynamic graph architectures and multimodal LLMs through temporal-event reasoning.

Anomaly Detection Boundary Detection +2

3D CoCa: Contrastive Learners are 3D Captioners

1 code implementation13 Apr 2025 Ting Huang, Zeyu Zhang, Yemin Wang, Hao Tang

3D captioning, which aims to describe the content of 3D scenes in natural language, remains highly challenging due to the inherent sparsity of point clouds and weak cross-modal alignment in existing methods.

3D dense captioning Caption Generation +4

Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition

no code implementations AAAI 2025 Zhicheng Zhang, Hao Tang, Jinhui Tang

Specifically, we first propose a multi-scale cue activation module to ensure the discriminative cues learned at different stage are mutually different.

Follow Your Motion: A Generic Temporal Consistency Portrait Editing Framework with Trajectory Guidance

no code implementations28 Mar 2025 Haijie Yang, Zhenyu Zhang, Hao Tang, Jianjun Qian, Jian Yang

However, they often face challenges with temporal consistency, particularly in the talking head domain, where continuous changes in facial expressions intensify the level of difficulty.

PartRM: Modeling Part-Level Dynamics with Large Cross-State Reconstruction Model

no code implementations CVPR 2025 Mingju Gao, Yike Pan, Huan-ang Gao, Zongzheng Zhang, Wenyi Li, Hao Dong, Hao Tang, Li Yi, Hao Zhao

As interest grows in world models that predict future states from current observations and actions, accurately modeling part-level dynamics has become increasingly relevant for various applications.

4D reconstruction

Beyond Semantics: Rediscovering Spatial Awareness in Vision-Language Models

no code implementations21 Mar 2025 Jianing Qi, Jiawei Liu, Hao Tang, Zhigang Zhu

Vision-Language Models (VLMs) excel at identifying and describing objects but struggle with spatial reasoning such as accurately understanding the relative positions of objects.

Diagnostic Object Recognition +1

MambaIC: State Space Models for High-Performance Learned Image Compression

1 code implementation CVPR 2025 Fanhu Zeng, Hao Tang, Yihua Shao, Siyu Chen, Ling Shao, Yan Wang

Inspired by the effectiveness of state space models (SSMs) in capturing long-range dependencies, we leverage SSMs to address computational inefficiency in existing methods and improve image compression from multiple perspectives.

Image Compression State Space Models

Dynamic Scene Reconstruction: Recent Advance in Real-time Rendering and Streaming

no code implementations11 Mar 2025 Jiaxuan Zhu, Hao Tang

Representing and rendering dynamic scenes from 2D images is a fundamental yet challenging problem in computer vision and graphics.

OT-DETECTOR: Delving into Optimal Transport for Zero-shot Out-of-Distribution Detection

no code implementations9 Mar 2025 Yu Liu, Hao Tang, Haiqi Zhang, Jing Qin, Zechao Li

Out-of-distribution (OOD) detection is crucial for ensuring the reliability and safety of machine learning models in real-world applications.

Out-of-Distribution Detection Out of Distribution (OOD) Detection

TR-DQ: Time-Rotation Diffusion Quantization

no code implementations9 Mar 2025 Yihua Shao, Deyang Lin, Fanhu Zeng, Minxi Yan, Muyang Zhang, Siyu Chen, Yuxuan Fan, Ziyang Yan, Haozhe Wang, Jingcai Guo, Yan Wang, Haotong Qin, Hao Tang

TR-DQ achieves state-of-the-art (SOTA) performance on image generation and video generation tasks and a 1. 38-1. 89x speedup and 1. 97-2. 58x memory reduction in inference compared to existing quantization methods.

Image Generation Quantization +1

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

1 code implementation3 Mar 2025 Hao Tang, ChenWei Xie, Haiyang Wang, Xiaoyi Bao, Tingyu Weng, Pandeng Li, Yun Zheng, LiWei Wang

Generalist models have achieved remarkable success in both language and vision-language tasks, showcasing the potential of unified modeling.

Instance Segmentation Reasoning Segmentation +2

Improved YOLOv7x-Based Defect Detection Algorithm for Power Equipment

no code implementations25 Feb 2025 Jin Hou, Hao Tang

The normal operation of power equipment plays a critical role in the power system, making anomaly detection for power equipment highly significant.

Anomaly Detection Defect Detection

Parameter Efficient Merging for Multimodal Large Language Models with Complementary Parameter Adaptation

no code implementations24 Feb 2025 Fanhu Zeng, Haiyang Guo, Fei Zhu, Li Shen, Hao Tang

With the expansion in data and model size, parameter efficient tuning becomes the common practice for obtaining task-specific models efficiently.

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization

no code implementations18 Feb 2025 Tzu-Quan Lin, Wei-Ping Huang, Hao Tang, Hung-Yi Lee

Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model, and thus could possibly lose cross-task generalization.

Automatic Speech Recognition Speaker Identification +2

FE-UNet: Frequency Domain Enhanced U-Net with Segment Anything Capability for Versatile Image Segmentation

no code implementations6 Feb 2025 Guohao Huo, Ruiting Dai, Ling Shao, Hao Tang

To further emulate the human visual system, we introduce the Frequency Domain Enhanced Receptive Field Block (FE-RFB), which integrates WSPM to extract enriched features from the frequency domain.

Image Segmentation Segmentation +1

RFMedSAM 2: Automatic Prompt Refinement for Enhanced Volumetric Medical Image Segmentation with SAM 2

no code implementations4 Feb 2025 Bin Xie, Hao Tang, Yan Yan, Gady Agam

Segment Anything Model 2 (SAM 2), a prompt-driven foundation model extending SAM to both image and video domains, has shown superior zero-shot performance compared to its predecessor.

Image Segmentation Semantic Segmentation +1

In-Context Meta LoRA Generation

no code implementations29 Jan 2025 Yihua Shao, Minxi Yan, Yang Liu, Siyu Chen, Wenjie Chen, Xinwei Long, Ziyang Yan, Lei LI, Chenyu Zhang, Nicu Sebe, Hao Tang, Yan Wang, Hao Zhao, Mengzhu Wang, Jingcai Guo

As a result, our method achieves more accurate LoRA parameter generation for diverse tasks using CVAE.

Meta-Learning

A Training-free Synthetic Data Selection Method for Semantic Segmentation

1 code implementation25 Jan 2025 Hao Tang, Siyue Yu, Jian Pang, Bingfeng Zhang

Then we propose a class-balance Annotation Similarity Filter (ASF) by comparing the synthetic annotation with the response of CLIP to remove the samples related to low-quality annotations.

Semantic Segmentation

UDiTQC: U-Net-Style Diffusion Transformer for Quantum Circuit Synthesis

no code implementations24 Jan 2025 Zhiwei Chen, Hao Tang

Quantum computing is a transformative technology with wide-ranging applications, and efficient quantum circuit generation is crucial for unlocking its full potential.

Computational Efficiency Quantum Circuit Generation

Fast3R: Towards 3D Reconstruction of 1000+ Images in One Forward Pass

1 code implementation CVPR 2025 Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, Matt Feiszli

Multi-view 3D reconstruction remains a core challenge in computer vision, particularly in applications requiring accurate and scalable representations across diverse perspectives.

3D Reconstruction Camera Pose Estimation +2

Enhanced Multi-Scale Cross-Attention for Person Image Generation

no code implementations15 Jan 2025 Hao Tang, Ling Shao, Nicu Sebe, Luc van Gool

Moreover, we propose two novel cross-attention blocks to effectively transfer and update the person's shape and appearance embeddings for mutual improvement.

Generative Adversarial Network Image Generation

RoRA: Efficient Fine-Tuning of LLM with Reliability Optimization for Rank Adaptation

no code implementations8 Jan 2025 Jun Liu, Zhenglun Kong, Peiyan Dong, Changdi Yang, Xuan Shen, Pu Zhao, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang

Although Low-Rank Adaptation (LoRA) is widely used and effective for fine-tuning, we have observed that its scaling factor can limit or even reduce performance as the rank size increases.

End-to-End Long Document Summarization using Gradient Caching

no code implementations3 Jan 2025 Rohit Saxena, Hao Tang, Frank Keller

Training transformer-based encoder-decoder models for long document summarization poses a significant challenge due to the quadratic memory consumption during training.

Decoder Document Summarization +1

Boosting Adversarial Transferability with Spatial Adversarial Alignment

no code implementations2 Jan 2025 Zhaoyu Chen, Haijing Guo, Kaixun Jiang, Jiyuan Fu, Xinyu Zhou, Dingkang Yang, Hao Tang, Bo Li, Wenqiang Zhang

To achieve high transferability, we propose a technique termed Spatial Adversarial Alignment (SAA), which employs an alignment loss and leverages a witness model to fine-tune the surrogate model.

Data Augmentation

HOIGPT: Learning Long-Sequence Hand-Object Interaction with Language Models

no code implementations CVPR 2025 Mingzhen Huang, Fu-Jen Chu, Bugra Tekin, Kevin J. Liang, Haoyu Ma, Weiyao Wang, Xingyu Chen, Pierre Gleize, Hongfei Xue, Siwei Lyu, Kris Kitani, Matt Feiszli, Hao Tang

We introduce HOIGPT, a token-based generative method that unifies 3D hand-object interactions (HOI) perception and generation, offering the first comprehensive solution for captioning and generating high-quality 3D HOI sequences from a diverse range of conditional signals (e. g. text, objects, partial sequences).

Language Modeling Language Modelling +3

Artificial Intelligence for Central Dogma-Centric Multi-Omics: Challenges and Breakthroughs

no code implementations17 Dec 2024 Lei Xin, Caiyun Huang, Hao Li, Shihong Huang, Yuling Feng, Zhenglun Kong, Zicheng Liu, Siyuan Li, Chang Yu, Fei Shen, Hao Tang

With the rapid development of high-throughput sequencing platforms, an increasing number of omics technologies, such as genomics, metabolomics, and transcriptomics, are being applied to disease genetics research.

Articles Disease Prediction

Divide-and-Conquer: Confluent Triple-Flow Network for RGB-T Salient Object Detection

no code implementations2 Dec 2024 Hao Tang, Zechao Li, Dong Zhang, Shengfeng He, Jinhui Tang

Furthermore, a Modality-aware Dynamic Aggregation Module in the modality-complementary flow dynamically aggregates saliency-related cues from both modality-specific flows.

object-detection Object Detection +1

Network Inversion and Its Applications

no code implementations26 Nov 2024 Pirzada Suhail, Hao Tang, Amit Sethi

Neural networks have emerged as powerful tools across various applications, yet their decision-making process often remains opaque, leading to them being perceived as "black boxes."

Decision Making Diversity +1

Multimodal Alignment and Fusion: A Survey

no code implementations26 Nov 2024 Songtao Li, Hao Tang

This survey offers a comprehensive review of recent advancements in multimodal alignment and fusion within machine learning, spurred by the growing diversity of data types such as text, images, audio, and video.

Data Integration Diversity +3

Text-to-Image Synthesis: A Decade Survey

no code implementations25 Nov 2024 Nonghai Zhang, Hao Tang

When humans read a specific text, they often visualize the corresponding images, and we hope that computers can do the same.

Diversity Image Generation +1

Hierarchical Cross-Attention Network for Virtual Try-On

no code implementations23 Nov 2024 Hao Tang, Bin Ren, Pingping Wu, Nicu Sebe

In this paper, we present an innovative solution for the challenges of the virtual try-on task: our novel Hierarchical Cross-Attention Network (HCANet).

Geometric Matching Virtual Try-on

AllRestorer: All-in-One Transformer for Image Restoration under Composite Degradations

no code implementations16 Nov 2024 Jiawei Mao, Yu Yang, Xuesong Yin, Ling Shao, Hao Tang

Specifically, we introduce an All-in-One Transformer Block (AiOTB), which adaptively removes all degradations present in a given image by modeling the relationships between all degradations and the image embedding in latent space.

All Image Restoration

DiffFNO: Diffusion Fourier Neural Operator

no code implementations CVPR 2025 Xiaoyi Liu, Hao Tang

We introduce DiffFNO, a novel diffusion framework for arbitrary-scale super-resolution strengthened by a Weighted Fourier Neural Operator (WFNO).

Computational Efficiency Super-Resolution

KMM: Key Frame Mask Mamba for Extended Motion Generation

1 code implementation10 Nov 2024 Zeyu Zhang, Hang Gao, Akide Liu, Qi Chen, Feng Chen, Yiran Wang, Danning Li, Rui Zhao, ZhenMing Li, Zhongwen Zhou, Hao Tang, Bohan Zhuang

The recent Mamba architecture shows promising results in efficiently modeling long and complex sequences, yet two significant challenges remain: Firstly, directly applying Mamba to extended motion generation is ineffective, as the limited capacity of the implicit memory leads to memory decay.

Contrastive Learning Mamba +1

Layer-Wise Feature Metric of Semantic-Pixel Matching for Few-Shot Learning

1 code implementation10 Nov 2024 Hao Tang, Junhao Lu, Guoheng Huang, Ming Li, Xuhang Chen, Guo Zhong, Zhengguang Tan, Zinuo Li

In Few-Shot Learning (FSL), traditional metric-based approaches often rely on global metrics to compute similarity.

Few-Shot Learning

Combining Induction and Transduction for Abstract Reasoning

1 code implementation4 Nov 2024 Wen-Ding Li, Keya Hu, Carter Larsen, Yuqing Wu, Simon Alford, Caleb Woo, Spencer M. Dunn, Hao Tang, Michelangelo Naim, Dat Nguyen, Wei-Long Zheng, Zenna Tavares, Yewen Pu, Kevin Ellis

When learning an input-output mapping from very few examples, is it better to first infer a latent function that explains the examples, or is it better to directly predict new test outputs, e. g. using a neural network?

ARC Program Synthesis

VisualPredicator: Learning Abstract World Models with Neuro-Symbolic Predicates for Robot Planning

no code implementations30 Oct 2024 Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B. Tenenbaum, Tom Silver, João F. Henriques, Kevin Ellis

Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space.

Hierarchical Reinforcement Learning Language Modeling +2

Cavia: Camera-controllable Multi-view Video Diffusion with View-Integrated Attention

no code implementations14 Oct 2024 Dejia Xu, Yifan Jiang, Chen Huang, Liangchen Song, Thorsten Gernoth, Liangliang Cao, Zhangyang Wang, Hao Tang

Recent studies have attempted to incorporate camera control into the generation process, but their results are often limited to simple trajectories or lack the ability to generate consistent videos from multiple distinct camera paths for the same scene.

Image to Video Generation

VerifierQ: Enhancing LLM Test Time Compute with Q-Learning-based Verifiers

no code implementations10 Oct 2024 Jianing Qi, Hao Tang, Zhigang Zhu

Recent advancements in test time compute, particularly through the use of verifier models, have significantly enhanced the reasoning capabilities of Large Language Models (LLMs).

Mathematical Reasoning Q-Learning +1

OmniPose6D: Towards Short-Term Object Pose Tracking in Dynamic Scenes from Monocular RGB

no code implementations9 Oct 2024 Yunzhi Lin, Yipu Zhao, Fu-Jen Chu, Xingyu Chen, Weiyao Wang, Hao Tang, Patricio A. Vela, Matt Feiszli, Kevin Liang

To address the challenge of short-term object pose tracking in dynamic environments with monocular RGB input, we introduce a large-scale synthetic dataset OmniPose6D, crafted to mirror the diversity of real-world conditions.

Benchmarking Diversity +2

Toward Zero-Shot Learning for Visual Dehazing of Urological Surgical Robots

1 code implementation2 Oct 2024 Renkai Wu, Xianjin Wang, Pengchen Liang, Zhenyu Zhang, Qing Chang, Hao Tang

In addition, we organize and propose a dehaze dataset for robotic vision in urological surgery (USRobot-Dehaze dataset).

Zero-Shot Learning

A Simple HMM with Self-Supervised Representations for Phone Segmentation

no code implementations15 Sep 2024 Gene-Ping Yang, Hao Tang

Despite the recent advance in self-supervised representations, unsupervised phonetic segmentation remains challenging.

Segmentation Self-Supervised Learning

Estimating the Completeness of Discrete Speech Units

no code implementations9 Sep 2024 Sung-Lin Yeh, Hao Tang

We find that speaker information is sufficiently present in HuBERT discrete units, and that phonetic information is sufficiently present in the residual, showing that vector quantization does not achieve disentanglement.

Disentanglement Quantization

Property Neurons in Self-Supervised Speech Transformers

1 code implementation7 Sep 2024 Tzu-Quan Lin, Guan-Ting Lin, Hung-Yi Lee, Hao Tang

It is, however, desirable to have an approach that can pinpoint exactly a subset of neurons that is responsible for a particular property of speech, being amenable to model pruning and model editing.

Model Editing

Data-Free Class Incremental Gesture Recognition via Synthetic Feature Sampling

no code implementations21 Aug 2024 Zhenyu Lu, Hao Tang

Data-Free Class Incremental Learning (DFCIL) aims to enable models to continuously learn new classes while retraining knowledge of old classes, even when the training data for old classes is unavailable.

class-incremental learning Class Incremental Learning +2

Barbie: Text to Barbie-Style 3D Avatars

1 code implementation17 Aug 2024 Xiaokun Sun, Zhenyu Zhang, Ying Tai, Qian Wang, Hao Tang, Zili Yi, Jian Yang

In this paper, we propose Barbie, a novel framework for generating 3D avatars that can be dressed in diverse and high-quality Barbie-like garments and accessories.

Disentanglement Diversity

ADen: Adaptive Density Representations for Sparse-view Camera Pose Estimation

no code implementations16 Aug 2024 Hao Tang, Weiyao Wang, Pierre Gleize, Matt Feiszli

Recent data-driven approaches aim to directly output camera poses, either through regressing the 6DoF camera poses or formulating rotation as a probability distribution.

Camera Pose Estimation Pose Estimation

Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers

no code implementations25 Jul 2024 Zhengang Li, Alec Lu, Yanyue Xie, Zhenglun Kong, Mengshu Sun, Hao Tang, Zhong Jia Xue, Peiyan Dong, Caiwen Ding, Yanzhi Wang, Xue Lin, Zhenman Fang

This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs, to design efficient ViT models for hardware implementation while preserving the accuracy.

Quantization

Stable-Hair: Real-World Hair Transfer via Diffusion Model

1 code implementation19 Jul 2024 Yuxuan Zhang, Qing Zhang, Yiren Song, Jichao Zhang, Hao Tang, Jiaming Liu

In the second stage, we specifically designed a Hair Extractor and a Latent IdentityNet to transfer the target hairstyle with highly detailed and high-fidelity to the bald image.

Triplet

InfiniMotion: Mamba Boosts Memory in Transformer for Arbitrary Long Motion Generation

1 code implementation14 Jul 2024 Zeyu Zhang, Akide Liu, Qi Chen, Feng Chen, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang

Text-to-motion generation holds potential for film, gaming, and robotics, yet current methods often prioritize short motion generation, making it challenging to produce long motion sequences effectively: (1) Current methods struggle to handle long motion sequences as a single input due to prohibitively high computational cost; (2) Breaking down the generation of long motion sequences into shorter segments can result in inconsistent transitions and requires interpolation or inpainting, which lacks entire sequence modeling.

Mamba Motion Generation

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

1 code implementation13 Jul 2024 Xiaoxu Xu, Yitian Yuan, Jinlong Li, Qiudan Zhang, Zequn Jie, Lin Ma, Hao Tang, Nicu Sebe, Xu Wang

In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model.

3D Semantic Segmentation Language Modelling +3

3x2: 3D Object Part Segmentation by 2D Semantic Correspondences

no code implementations12 Jul 2024 Anh Thai, Weiyao Wang, Hao Tang, Stefan Stojanov, Matt Feiszli, James M. Rehg

While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect.

Object Segmentation

GMC: A General Framework of Multi-stage Context Learning and Utilization for Visual Detection Tasks

no code implementations8 Jul 2024 Xuan Wang, Hao Tang, Zhigang Zhu

In this paper, GMC, a general framework is proposed for multistage context learning and utilization, with various deep network architectures for various visual detection tasks.

In-Context Learning object-detection +2

ARNet: Self-Supervised FG-SBIR with Unified Sample Feature Alignment and Multi-Scale Token Recycling

2 code implementations17 Jun 2024 Jianan Jiang, Hao Tang, Zhilin Jiang, Weiren Yu, Di wu

Fine-Grained Sketch-Based Image Retrieval (FG-SBIR) aims to minimize the distance between sketches and corresponding images in the embedding space.

Retrieval Sketch-Based Image Retrieval

Orthogonality and isotropy of speaker and phonetic information in self-supervised speech representations

no code implementations13 Jun 2024 Mukhtar Mohamed, Oli Danyi Liu, Hao Tang, Sharon Goldwater

Self-supervised speech representations can hugely benefit downstream speech technologies, yet the properties that make them useful are still poorly understood.

DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models

1 code implementation8 Jun 2024 Tzu-Quan Lin, Hung-Yi Lee, Hao Tang

We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss, eliminating the need for multiple round of training and fine-tuning.

From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks

1 code implementation4 Jun 2024 Xiaofeng Zhang, Yihao Quan, Chen Shen, Xiaosong Yuan, Shaotian Yan, Liang Xie, Wenxiao Wang, Chaochen Gu, Hao Tang, Jieping Ye

Large Vision Language Models (LVLMs) achieve great performance on visual-language reasoning tasks, however, the black-box nature of LVLMs hinders in-depth research on the reasoning mechanism.

Image Captioning Language Modelling +3

Dataset Growth

1 code implementation28 May 2024 Ziheng Qin, Zhaopan Xu, Yukun Zhou, Zangwei Zheng, Zebang Cheng, Hao Tang, Lei Shang, Baigui Sun, Xiaojiang Peng, Radu Timofte, Hongxun Yao, Kai Wang, Yang You

To tackle this challenge, we propose InfoGrowth, an efficient online algorithm for data cleaning and selection, resulting in a growing dataset that keeps up to date with awareness of cleanliness and diversity.

Diversity

Code Repair with LLMs gives an Exploration-Exploitation Tradeoff

no code implementations26 May 2024 Hao Tang, Keya Hu, Jin Peng Zhou, Sicheng Zhong, Wei-Long Zheng, Xujie Si, Kevin Ellis

Iteratively improving and repairing source code with large language models (LLMs), known as refinement, has emerged as a popular way of generating programs that would be too complex to construct in one shot.

Code Repair Language Modeling +4

Multi-task learning for molecular electronic structure approaching coupled-cluster accuracy

1 code implementation9 May 2024 Hao Tang, Brian Xiao, Wenhao He, Pero Subasic, Avetik R. Harutyunyan, Yao Wang, Fang Liu, Haowei Xu, Ju Li

Machine learning (ML) plays an important role in quantum chemistry, providing fast-to-evaluate predictive models for various properties of molecules.

Multi-Task Learning

DVF: Advancing Robust and Accurate Fine-Grained Image Retrieval with Retrieval Guidelines

no code implementations24 Apr 2024 Xin Jiang, Hao Tang, Rui Yan, Jinhui Tang, Zechao Li

This paper presents a meticulous analysis leading to the proposal of practical guidelines to identify subcategory-specific discrepancies and generate discriminative features to design effective FGIR models.

Image Retrieval Retrieval

AccidentBlip: Agent of Accident Warning based on MA-former

no code implementations18 Apr 2024 Yihua Shao, Yeling Xu, Xinwei Long, Siyu Chen, Ziyang Yan, Yang Yang, Haoting Liu, Yan Wang, Hao Tang, Zhen Lei

In particular, AccidentBlip achieves SOTA performance in both accident detection and prediction tasks on the DeepAccident dataset.

Language Modelling Large Language Model +2

StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion

1 code implementation9 Apr 2024 Ming Tao, Bing-Kun Bao, Hao Tang, YaoWei Wang, Changsheng Xu

3) The story visualization and continuation models are trained and inferred independently, which is not user-friendly.

Image Generation Story Visualization

HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud

1 code implementation CVPR 2024 Wencan Cheng, Hao Tang, Luc van Gool, Jong Hwan Ko

Extracting keypoint locations from input hand frames, known as 3D hand pose estimation, is a critical task in various human-computer interaction applications.

3D Hand Pose Estimation

Towards Robust 3D Pose Transfer with Adversarial Learning

no code implementations CVPR 2024 Haoyu Chen, Hao Tang, Ehsan Adeli, Guoying Zhao

This work is driven by the intuition that the robustness of the model can be enhanced by introducing adversarial samples into the training, leading to a more invulnerable model to the noisy inputs, which even can be further extended to directly handling the real-world data like raw point clouds/scans without intermediate processing.

3D Generation Pose Transfer

Versatile Navigation under Partial Observability via Value-guided Diffusion Policy

no code implementations CVPR 2024 Gengyu Zhang, Hao Tang, Yan Yan

To address these deficiencies, we propose a versatile diffusion-based approach for both 2D and 3D route planning under partial observability.

Autonomous Driving Semantic Segmentation

On the Faithfulness of Vision Transformer Explanations

no code implementations CVPR 2024 Junyi Wu, Weitai Kang, Hao Tang, Yuan Hong, Yan Yan

In contrast, our proposed SaCo offers a reliable faithfulness measurement, establishing a robust metric for interpretations.

Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency

1 code implementation CVPR 2024 YingJie Xu, Bangzhen Liu, Hao Tang, Bailin Deng, Shengfeng He

We propose a voxel-based optimization framework, ReVoRF, for few-shot radiance fields that strategically address the unreliability in pseudo novel view synthesis.

Novel View Synthesis

Towards Online Real-Time Memory-based Video Inpainting Transformers

no code implementations24 Mar 2024 Guillaume Thiry, Hao Tang, Radu Timofte, Luc van Gool

Video inpainting tasks have seen significant improvements in recent years with the rise of deep neural networks and, in particular, vision transformers.

Video Inpainting

Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer

no code implementations CVPR 2024 Junyi Wu, Bin Duan, Weitai Kang, Hao Tang, Yan Yan

To incorporate the influence of token transformation into interpretation, we propose TokenTM, a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects.

MaskSAM: Towards Auto-prompt SAM with Mask Classification for Medical Image Segmentation

no code implementations21 Mar 2024 Bin Xie, Hao Tang, Bin Duan, Dawen Cai, Yan Yan

Each pair of auxiliary mask and box prompts, which can solve the requirements of extra prompts, is associated with class label predictions by the sum of the auxiliary classifier token and the learnable global classifier tokens in the mask decoder of SAM to solve the predictions of semantic labels.

Decoder Image Segmentation +3

StableGarment: Garment-Centric Generation via Stable Diffusion

no code implementations16 Mar 2024 Rui Wang, Hailong Guo, Jiaming Liu, Huaxia Li, Haibo Zhao, Xu Tang, Yao Hu, Hao Tang, Peipei Li

In this paper, we introduce StableGarment, a unified framework to tackle garment-centric(GC) generation tasks, including GC text-to-image, controllable GC text-to-image, stylized GC text-to-image, and robust virtual try-on.

Denoising Image Generation +1

Toward Adaptive Large Language Models Structured Pruning via Hybrid-grained Weight Importance Assessment

no code implementations16 Mar 2024 Jun Liu, Zhenglun Kong, Pu Zhao, Changdi Yang, Hao Tang, Xuan Shen, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Dong Huang, Yanzhi Wang

For example, HyWIA surpasses the cutting-edge LLM-Pruner by an average margin of 2. 82% in accuracy across seven downstream tasks when pruning LLaMA-7B by 50%.

Decoder Language Modelling +1

GiT: Towards Generalist Vision Transformer through Universal Language Interface

1 code implementation14 Mar 2024 Haiyang Wang, Hao Tang, Li Jiang, Shaoshuai Shi, Muhammad Ferjad Naeem, Hongsheng Li, Bernt Schiele, LiWei Wang

Due to its simple design, this paradigm holds promise for narrowing the architectural gap between vision and language.

Ranked #2 on Video Captioning on MSVD-CTN (using extra training data)

Language Modeling Language Modelling +1

Motion Mamba: Efficient and Long Sequence Motion Generation

1 code implementation12 Mar 2024 Zeyu Zhang, Akide Liu, Ian Reid, Richard Hartley, Bohan Zhuang, Hao Tang

Human motion generation stands as a significant pursuit in generative computer vision, while achieving long-sequence and efficient motion generation remains challenging.

Mamba Motion Generation +2

InstructGIE: Towards Generalizable Image Editing

no code implementations8 Mar 2024 Zichong Meng, Changdi Yang, Jun Liu, Hao Tang, Pu Zhao, Yanzhi Wang

In response to this challenge, our study introduces a novel image editing framework with enhanced generalization robustness by boosting in-context learning capability and unifying language instruction.

Denoising In-Context Learning

Hierarchical Indexing for Retrieval-Augmented Opinion Summarization

1 code implementation1 Mar 2024 Tom Hosking, Hao Tang, Mirella Lapata

We show that HIRO learns an encoding space that is more semantically structured than prior work, and generates summaries that are more representative of the opinions in the input reviews.

Opinion Summarization Retrieval

WorldCoder, a Model-Based LLM Agent: Building World Models by Writing Code and Interacting with the Environment

no code implementations19 Feb 2024 Hao Tang, Darren Key, Kevin Ellis

We give a model-based agent that builds a Python program representing its knowledge of the world based on its interactions with the environment.

Program Synthesis Task Planning

SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation

1 code implementation CVPR 2024 Yuxuan Zhang, Yiren Song, Jiaming Liu, Rui Wang, Jinpeng Yu, Hao Tang, Huaxia Li, Xu Tang, Yao Hu, Han Pan, Zhongliang Jing

Recent advancements in subject-driven image generation have led to zero-shot generation, yet precise selection and focus on crucial subject representations remain challenging.

Image Generation

Enlighten-Your-Voice: When Multimodal Meets Zero-shot Low-light Image Enhancement

no code implementations15 Dec 2023 Xiaofeng Zhang, Zishan Xu, Hao Tang, Chaochen Gu, Wei Chen, Shanying Zhu, Xinping Guan

Low-light image enhancement is a crucial visual task, and many unsupervised methods tend to overlook the degradation of visible information in low-light scenes, which adversely affects the fusion of complementary information and hinders the generation of satisfactory results.

Low-Light Image Enhancement

Efficient-NeRF2NeRF: Streamlining Text-Driven 3D Editing with Multiview Correspondence-Enhanced Diffusion Models

no code implementations13 Dec 2023 Liangchen Song, Liangliang Cao, Jiatao Gu, Yifan Jiang, Junsong Yuan, Hao Tang

In this work, we propose that by incorporating correspondence regularization into diffusion models, the process of 3D editing can be significantly accelerated.

GPU

Learning Contrastive Self-Distillation for Ultra-Fine-Grained Visual Categorization Targeting Limited Samples

no code implementations10 Nov 2023 Ziye Fang, Xin Jiang, Hao Tang, Zechao Li

In the field of intelligent multimedia analysis, ultra-fine-grained visual categorization (Ultra-FGVC) plays a vital role in distinguishing intricate subcategories within broader categories.

Contrastive Learning Fine-Grained Visual Categorization

Multi-view Information Integration and Propagation for Occluded Person Re-identification

1 code implementation7 Nov 2023 Neng Dong, Shuanglin Yan, Hao Tang, Jinhui Tang, Liyan Zhang

Moreover, as multiple images with the same identity are not accessible in the testing stage, we devise an Information Propagation (IP) mechanism to distill knowledge from the comprehensive representation to that of a single occluded image.

Occluded Person Re-Identification

Towards High-quality HDR Deghosting with Conditional Diffusion Models

no code implementations2 Nov 2023 Qingsen Yan, Tao Hu, Yuan Sun, Hao Tang, Yu Zhu, Wei Dong, Luc van Gool, Yanning Zhang

To address this challenge, we formulate the HDR deghosting problem as an image generation that leverages LDR features as the diffusion model's condition, consisting of the feature condition generator and the noise predictor.

Denoising Image Generation

Towards Matching Phones and Speech Representations

no code implementations26 Oct 2023 Gene-Ping Yang, Hao Tang

We study two key properties that enable matching, namely, whether cluster centroids of self-supervised representations reduce the variability of phone instances and respect the relationship among phones.

Self-Supervised Learning

Pedestrian Accessible Infrastructure Inventory: Assessing Zero-Shot Segmentation on Multi-Mode Geospatial Data for All Pedestrian Types

no code implementations15 Oct 2023 Jiahao Xia, Gavin Gong, Jiawei Liu, Zhigang Zhu, Hao Tang

In this paper, a Segment Anything Model (SAM)-based pedestrian infrastructure segmentation workflow is designed and optimized, which is capable of efficiently processing multi-sourced geospatial data including LiDAR data and satellite imagery data.

All Segmentation +1

Does Graph Distillation See Like Vision Dataset Counterpart?

2 code implementations NeurIPS 2023 Beining Yang, Kai Wang, Qingyun Sun, Cheng Ji, Xingcheng Fu, Hao Tang, Yang You, JianXin Li

We validate the proposed SGDD across 9 datasets and achieve state-of-the-art results on all of them: for example, on the YelpChi dataset, our approach maintains 98. 6% test accuracy of training on the original graph dataset with 1, 000 times saving on the scale of the graph.

Anomaly Detection Dataset Distillation +2

Efficient-3DiM: Learning a Generalizable Single-image Novel-view Synthesizer in One Day

no code implementations4 Oct 2023 Yifan Jiang, Hao Tang, Jen-Hao Rick Chang, Liangchen Song, Zhangyang Wang, Liangliang Cao

Although the fidelity and generalizability are greatly improved, training such a powerful diffusion model requires a vast volume of training data and model parameters, resulting in a notoriously long time and high computational costs.

Image Generation Novel View Synthesis

Distilling ODE Solvers of Diffusion Models into Smaller Steps

no code implementations CVPR 2024 Sanghwan Kim, Hao Tang, Fisher Yu

Notably, our method incurs negligible computational overhead compared to previous distillation techniques, facilitating straightforward and rapid integration with existing samplers.

Denoising Knowledge Distillation

Light Field Diffusion for Single-View Novel View Synthesis

no code implementations20 Sep 2023 Yifeng Xiong, Haoyu Ma, Shanlin Sun, Kun Han, Hao Tang, Xiaohui Xie

Starting from the camera pose matrices, LFD transforms them into light field encoding, with the same shape as the reference image, to describe the direction of each ray.

Denoising Novel View Synthesis +1

Delving into Multimodal Prompting for Fine-grained Visual Classification

no code implementations16 Sep 2023 Xin Jiang, Hao Tang, Junyao Gao, Xiaoyu Du, Shengfeng He, Zechao Li

In this paper, we aim to fully exploit the capabilities of cross-modal description to tackle FGVC tasks and propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image pertaining (CLIP) model.

Classification Fine-Grained Image Classification

Temporal-aware Hierarchical Mask Classification for Video Semantic Segmentation

1 code implementation14 Sep 2023 Zhaochong An, Guolei Sun, Zongwei Wu, Hao Tang, Luc van Gool

Modern approaches have proved the huge potential of addressing semantic segmentation as a mask classification task which is widely used in instance-level segmentation.

Classification Decoder +3

Adapting Segment Anything Model for Change Detection in HR Remote Sensing Images

1 code implementation4 Sep 2023 Lei Ding, Kun Zhu, Daifeng Peng, Hao Tang, Kuiwu Yang, Lorenzo Bruzzone

In this work, we aim to utilize the strong visual recognition capabilities of VFMs to improve the change detection of high-resolution Remote Sensing Images (RSIs).

Change Detection Interactive Segmentation

M$^3$Net: Multi-view Encoding, Matching, and Fusion for Few-shot Fine-grained Action Recognition

no code implementations6 Aug 2023 Hao Tang, Jun Liu, Shuanglin Yan, Rui Yan, Zechao Li, Jinhui Tang

Due to the scarcity of manually annotated data required for fine-grained video understanding, few-shot fine-grained (FS-FG) action recognition has gained significant attention, with the aim of classifying novel fine-grained action categories with only a few labeled instances.

Decision Making Fine-grained Action Recognition +1

Interactive Neural Painting

no code implementations31 Jul 2023 Elia Peruzzo, Willi Menapace, Vidit Goel, Federica Arrigoni, Hao Tang, Xingqian Xu, Arman Chopikyan, Nikita Orlov, Yuxiao Hu, Humphrey Shi, Nicu Sebe, Elisa Ricci

This paper advances the state of the art in this emerging research domain by proposing the first approach for Interactive NP.

Decoder

Hybrid-CSR: Coupling Explicit and Implicit Shape Representation for Cortical Surface Reconstruction

no code implementations23 Jul 2023 Shanlin Sun, Thanh-Tung Le, Chenyu You, Hao Tang, Kun Han, Haoyu Ma, Deying Kong, Xiangyi Yan, Xiaohui Xie

We present Hybrid-CSR, a geometric deep-learning model that combines explicit and implicit shape representations for cortical surface reconstruction.

Surface Reconstruction

Edge Guided GANs with Multi-Scale Contrastive Learning for Semantic Image Synthesis

1 code implementation22 Jul 2023 Hao Tang, Guolei Sun, Nicu Sebe, Luc van Gool

To tackle 2), we design an effective module to selectively highlight class-dependent feature maps according to the original semantic layout to preserve the semantic information.

Contrastive Learning Image Generation

Erasing, Transforming, and Noising Defense Network for Occluded Person Re-Identification

1 code implementation14 Jul 2023 Neng Dong, Liyan Zhang, Shuanglin Yan, Hao Tang, Jinhui Tang

Occlusion perturbation presents a significant challenge in person re-identification (re-ID), and existing methods that rely on external visual cues require additional computational resources and only consider the issue of missing information caused by occlusion.

Adversarial Defense Occluded Person Re-Identification

Inter-Instance Similarity Modeling for Contrastive Learning

1 code implementation21 Jun 2023 Chengchao Shen, Dawei Liu, Hao Tang, Zhe Qu, Jianxin Wang

In this paper, we propose a novel image mix method, PatchMix, for contrastive learning in Vision Transformer (ViT), to model inter-instance similarities among images.

Contrastive Learning Instance Segmentation +4

Enlighten Anything: When Segment Anything Model Meets Low-Light Image Enhancement

2 code implementations17 Jun 2023 Qihan Zhao, Xiaofeng Zhang, Hao Tang, Chaochen Gu, Shanying Zhu

Image restoration is a low-level visual task, and most CNN methods are designed as black boxes, lacking transparency and intrinsic aesthetics.

Image Restoration Low-Light Image Enhancement +2

Edge-guided Representation Learning for Underwater Object Detection

no code implementations1 Jun 2023 Linhui Dai, Hong Liu, Pinhao Song, Hao Tang, Runwei Ding, Shengquan Li

The key to addressing these challenges is to focus the model on obtaining more discriminative information.

Object object-detection +2

Reversible Graph Neural Network-based Reaction Distribution Learning for Multiple Appropriate Facial Reactions Generation

1 code implementation24 May 2023 Tong Xu, Micol Spitale, Hao Tang, Lu Liu, Hatice Gunes, Siyang Song

This means that we approach this problem by considering the generation of a distribution of the listener's appropriate facial reactions instead of multiple different appropriate facial reactions, i. e., 'many' appropriate facial reaction labels are summarised as 'one' distribution label during training.

Graph Neural Network

Self-supervised Predictive Coding Models Encode Speaker and Phonetic Information in Orthogonal Subspaces

no code implementations21 May 2023 Oli Liu, Hao Tang, Sharon Goldwater

Self-supervised speech representations are known to encode both speaker and phonetic information, but how they are distributed in the high-dimensional space remains largely unexplored.

Disentanglement

Attributable and Scalable Opinion Summarization

1 code implementation19 May 2023 Tom Hosking, Hao Tang, Mirella Lapata

We propose a method for unsupervised opinion summarization that encodes sentences from customer reviews into a hierarchical discrete latent space, then identifies common opinions based on the frequency of their encodings.

Opinion Summarization Unsupervised Opinion Summarization

Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer

no code implementations CVPR 2023 Hao Tang, Songhua Liu, Tianwei Lin, Shaoli Huang, Fu Li, Dongliang He, Xinchao Wang

On the other hand, different from the vanilla version, we adopt a learnable scaling operation on content features before content-style feature interaction, which better preserves the original similarity between a pair of content features while ensuring the stylization quality.

Meta-Learning Style Transfer

Localized Region Contrast for Enhancing Self-Supervised Learning in Medical Image Segmentation

no code implementations6 Apr 2023 Xiangyi Yan, Junayed Naushad, Chenyu You, Hao Tang, Shanlin Sun, Kun Han, Haoyu Ma, James Duncan, Xiaohui Xie

In this paper, we propose a novel contrastive learning framework that integrates Localized Region Contrast (LRC) to enhance existing self-supervised pre-training methods for medical image segmentation.

Contrastive Learning Image Segmentation +5

Graph Transformer GANs for Graph-Constrained House Generation

no code implementations CVPR 2023 Hao Tang, Zhenyu Zhang, Humphrey Shi, Bo Li, Ling Shao, Nicu Sebe, Radu Timofte, Luc van Gool

We present a novel graph Transformer generative adversarial network (GTGAN) to learn effective graph node relations in an end-to-end fashion for the challenging graph-constrained house generation task.

Generative Adversarial Network House Generation +1

Analysis and Evaluation of Explainable Artificial Intelligence on Suicide Risk Assessment

no code implementations9 Mar 2023 Hao Tang, Aref Miri Rekavandi, Dharjinder Rooprai, Girish Dwivedi, Frank Sanfilippo, Farid Boussaid, Mohammed Bennamoun

This study investigates the effectiveness of Explainable Artificial Intelligence (XAI) techniques in predicting suicide risks and identifying the dominant causes for such behaviours.

Data Augmentation Decision Making +2

DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network

1 code implementation CVPR 2023 Xuan Shen, Yaohua Wang, Ming Lin, Yilun Huang, Hao Tang, Xiuyu Sun, Yanzhi Wang

To this end, a novel framework termed Mathematical Architecture Design for Deep CNN (DeepMAD) is proposed to design high-performance CNN models in a principled way.

GPU Image Classification +1

Bipartite Graph Diffusion Model for Human Interaction Generation

1 code implementation24 Jan 2023 Baptiste Chopin, Hao Tang, Mohamed Daoudi

The generation of natural human motion interactions is a hot topic in computer vision and computer animation.

Diversity model

Learning Concordant Attention via Target-aware Alignment for Visible-Infrared Person Re-identification

no code implementations ICCV 2023 Jianbing Wu, Hong Liu, Yuxin Su, Wei Shi, Hao Tang

Owing to the large distribution gap between the heterogeneous data in Visible-Infrared Person Re-identification (VI Re-ID), we point out that existing paradigms often suffer from the inter-modal semantic misalignment issue and thus fail to align and compare local details properly.

Cross-Modal Retrieval Person Re-Identification +1

Pruning Parameterization With Bi-Level Optimization for Efficient Semantic Segmentation on the Edge

no code implementations CVPR 2023 Changdi Yang, Pu Zhao, Yanyu Li, Wei Niu, Jiexiong Guan, Hao Tang, Minghai Qin, Bin Ren, Xue Lin, Yanzhi Wang

With the ever-increasing popularity of edge devices, it is necessary to implement real-time segmentation on the edge for autonomous driving and many other applications.

Autonomous Driving Segmentation +1

Few-shot Medical Image Segmentation with Cycle-resemblance Attention

no code implementations7 Dec 2022 Hao Ding, Changchang Sun, Hao Tang, Dawen Cai, Yan Yan

Recently, due to the increasing requirements of medical imaging applications and the professional requirements of annotating medical images, few-shot learning has gained increasing attention in the medical image semantic segmentation field.

Few-Shot Learning Image Segmentation +4

Peeling the Onion: Hierarchical Reduction of Data Redundancy for Efficient Vision Transformer Training

1 code implementation19 Nov 2022 Zhenglun Kong, Haoyu Ma, Geng Yuan, Mengshu Sun, Yanyue Xie, Peiyan Dong, Xin Meng, Xuan Shen, Hao Tang, Minghai Qin, Tianlong Chen, Xiaolong Ma, Xiaohui Xie, Zhangyang Wang, Yanzhi Wang

Vision transformers (ViTs) have recently obtained success in many applications, but their intensive computation and heavy memory usage at both training and inference time limit their generalization.

Is Smaller Always Faster? Tradeoffs in Compressing Self-Supervised Speech Transformers

1 code implementation17 Nov 2022 Tzu-Quan Lin, Tsung-Huan Yang, Chun-Yao Chang, Kuang-Ming Chen, Tzu-hsun Feng, Hung-Yi Lee, Hao Tang

Transformer-based self-supervised models have achieved remarkable success in speech processing, but their large size and high inference cost present significant challenges for real-world deployment.

Knowledge Distillation Model Compression +1

MelHuBERT: A simplified HuBERT on Mel spectrograms

1 code implementation17 Nov 2022 Tzu-Quan Lin, Hung-Yi Lee, Hao Tang

Self-supervised models have had great success in learning speech representations that can generalize to various downstream tasks.

Automatic Speech Recognition Self-Supervised Learning +3

Deep Unsupervised Key Frame Extraction for Efficient Video Classification

no code implementations12 Nov 2022 Hao Tang, Lei Ding, Songsong Wu, Bin Ren, Nicu Sebe, Paolo Rota

The proposed TSDPC is a generic and powerful framework and it has two advantages compared with previous works, one is that it can calculate the number of key frames automatically.

Classification Video Classification

Bipartite Graph Reasoning GANs for Person Pose and Facial Image Synthesis

1 code implementation12 Nov 2022 Hao Tang, Ling Shao, Philip H. S. Torr, Nicu Sebe

To further capture the change in pose of each part more precisely, we propose a novel part-aware bipartite graph reasoning (PBGR) block to decompose the task of reasoning the global structure transformation with a bipartite graph into learning different local transformations for different semantic body/face parts.

Generative Adversarial Network Image Generation

Data Level Lottery Ticket Hypothesis for Vision Transformers

1 code implementation2 Nov 2022 Xuan Shen, Zhenglun Kong, Minghai Qin, Peiyan Dong, Geng Yuan, Xin Meng, Hao Tang, Xiaolong Ma, Yanzhi Wang

That is, there exists a subset of input image patches such that a ViT can be trained from scratch by using only this subset of patches and achieve similar accuracy to the ViTs trained by using all image patches.

Analogical Similarity Informativeness

Learning Dependencies of Discrete Speech Representations with Neural Hidden Markov Models

no code implementations29 Oct 2022 Sung-Lin Yeh, Hao Tang

While discrete latent variable models have had great success in self-supervised learning, most models assume that frames are independent.

Self-Supervised Learning

Analyzing Acoustic Word Embeddings from Pre-trained Self-supervised Speech Models

no code implementations28 Oct 2022 Ramon Sanabria, Hao Tang, Sharon Goldwater

Given the strong results of self-supervised models on various tasks, there have been surprisingly few studies exploring self-supervised representations for acoustic word embeddings (AWE), fixed-dimensional vectors representing variable-length spoken word segments.

Word Embeddings

Conditioning and Sampling in Variational Diffusion Models for Speech Super-Resolution

1 code implementation27 Oct 2022 Chin-Yun Yu, Sung-Lin Yeh, György Fazekas, Hao Tang

Moreover, by coupling the proposed sampling method with an unconditional DM, i. e., a DM with no auxiliary inputs to its noise predictor, we can generalize it to a wide range of SR setups.

Super-Resolution

ADPS: Asymmetric Distillation Post-Segmentation for Image Anomaly Detection

no code implementations19 Oct 2022 Peng Xing, Hao Tang, Jinhui Tang, Zechao Li

However, existing KDAD methods suffer from two main limitations: 1) the student network can effortlessly replicate the teacher network's representations, and 2) the features of the teacher network serve solely as a ``reference standard" and are not fully leveraged.

Anomaly Detection Anomaly Localization +1

On Compressing Sequences for Self-Supervised Speech Models

no code implementations13 Oct 2022 Yen Meng, Hsuan-Jui Chen, Jiatong Shi, Shinji Watanabe, Paola Garcia, Hung-Yi Lee, Hao Tang

Subsampling while training self-supervised models not only improves the overall performance on downstream tasks under certain frame rates, but also brings significant speed-up in inference.

Self-Supervised Learning

SiNeRF: Sinusoidal Neural Radiance Fields for Joint Pose Estimation and Scene Reconstruction

1 code implementation10 Oct 2022 Yitong Xia, Hao Tang, Radu Timofte, Luc van Gool

NeRFmm is the Neural Radiance Fields (NeRF) that deal with Joint Optimization tasks, i. e., reconstructing real-world scenes and registering camera parameters simultaneously.

Image Generation NeRF +1

Boosting Few-shot Fine-grained Recognition with Background Suppression and Foreground Alignment

1 code implementation4 Oct 2022 Zican Zha, Hao Tang, Yunlian Sun, Jinhui Tang

To address this challenging task, we propose a two-stage background suppression and foreground alignment framework, which is composed of a background activation suppression (BAS) module, a foreground object alignment (FOA) module, and a local-to-local (L2L) similarity metric.

Few-Shot Learning

Physical Adversarial Attack meets Computer Vision: A Decade Survey

1 code implementation30 Sep 2022 Hui Wei, Hao Tang, Xuemei Jia, Zhixiang Wang, Hanxun Yu, Zhubo Li, Shin'ichi Satoh, Luc van Gool, Zheng Wang

Building upon this foundation, we uncover the pervasive role of artifacts carrying adversarial perturbations in the physical world.

Adversarial Attack Medical Diagnosis +1

PPT: token-Pruned Pose Transformer for monocular and multi-view human pose estimation

2 code implementations16 Sep 2022 Haoyu Ma, Zhe Wang, Yifei Chen, Deying Kong, Liangjian Chen, Xingwei Liu, Xiangyi Yan, Hao Tang, Xiaohui Xie

In this paper, we propose the token-Pruned Pose Transformer (PPT) for 2D human pose estimation, which can locate a rough human mask and performs self-attention only within selected tokens.

Ranked #20 on 3D Human Pose Estimation on Human3.6M (using extra training data)

2D Human Pose Estimation 3D Human Pose Estimation

Facial Expression Translation using Landmark Guided GANs

1 code implementation5 Sep 2022 Hao Tang, Nicu Sebe

We propose a simple yet powerful Landmark guided Generative Adversarial Network (LandmarkGAN) for the facial expression-to-expression translation using a single image, which is an important and challenging task in computer vision since the expression-to-expression translation is a non-linear and non-aligned problem.

Facial Expression Translation Generative Adversarial Network +1

Image-Specific Information Suppression and Implicit Local Alignment for Text-based Person Search

no code implementations30 Aug 2022 Shuanglin Yan, Hao Tang, Liyan Zhang, Jinhui Tang

Moreover, existing methods seldom consider the information inequality problem between modalities caused by image-specific information.

Person Search Text based Person Search

Training and Tuning Generative Neural Radiance Fields for Attribute-Conditional 3D-Aware Face Generation

1 code implementation26 Aug 2022 Jichao Zhang, Aliaksandr Siarohin, Yahui Liu, Hao Tang, Nicu Sebe, Wei Wang

Generative Neural Radiance Fields (GNeRF)-based 3D-aware GANs have showcased remarkable prowess in crafting high-fidelity images while upholding robust 3D consistency, particularly face generation.

Attribute Disentanglement +2

Identity-Sensitive Knowledge Propagation for Cloth-Changing Person Re-identification

1 code implementation25 Aug 2022 Jianbing Wu, Hong Liu, Wei Shi, Hao Tang, Jingwen Guo

To mitigate the resolution degradation issue and mine identity-sensitive cues from human faces, we propose to restore the missing facial details using prior facial knowledge, which is then propagated to a smaller network.

Cloth-Changing Person Re-Identification Human Parsing

G2P-DDM: Generating Sign Pose Sequence from Gloss Sequence with Discrete Diffusion Model

no code implementations19 Aug 2022 Pan Xie, Qipeng Zhang, Taiyi Peng, Hao Tang, Yao Du, Zexian Li

Our approach focuses on the transformation of sign gloss sequences into their corresponding sign pose sequences (G2P).

Denoising Quantization +1

Auto-ViT-Acc: An FPGA-Aware Automatic Acceleration Framework for Vision Transformer with Mixed-Scheme Quantization

no code implementations10 Aug 2022 Zhengang Li, Mengshu Sun, Alec Lu, Haoyu Ma, Geng Yuan, Yanyue Xie, Hao Tang, Yanyu Li, Miriam Leeser, Zhangyang Wang, Xue Lin, Zhenman Fang

Compared with state-of-the-art ViT quantization work (algorithmic approach only without hardware acceleration), our quantization achieves 0. 47% to 1. 36% higher Top-1 accuracy under the same bit-width.

Quantization

Compiler-Aware Neural Architecture Search for On-Mobile Real-time Super-Resolution

1 code implementation25 Jul 2022 Yushu Wu, Yifan Gong, Pu Zhao, Yanyu Li, Zheng Zhan, Wei Niu, Hao Tang, Minghai Qin, Bin Ren, Yanzhi Wang

Instead of measuring the speed on mobile devices at each iteration during the search process, a speed model incorporated with compiler optimizations is leveraged to predict the inference latency of the SR block with various width configurations for faster convergence.

GPU Neural Architecture Search +2

Towards Interpretable Video Super-Resolution via Alternating Optimization

1 code implementation21 Jul 2022 JieZhang Cao, Jingyun Liang, Kai Zhang, Wenguan Wang, Qin Wang, Yulun Zhang, Hao Tang, Luc van Gool

These issues can be alleviated by a cascade of three separate sub-tasks, including video deblurring, frame interpolation, and super-resolution, which, however, would fail to capture the spatial and temporal correlations among video sequences.

Deblurring Space-time Video Super-resolution +3

MLP-GAN for Brain Vessel Image Segmentation

no code implementations17 Jul 2022 Bin Xie, Hao Tang, Bin Duan, Dawen Cai, Yan Yan

Brain vessel image segmentation can be used as a promising biomarker for better prevention and treatment of different diseases.

Generative Adversarial Network Image Segmentation +2

RCRN: Real-world Character Image Restoration Network via Skeleton Extraction

1 code implementation16 Jul 2022 Daqian Shi, Xiaolei Diao, Hao Tang, Xiaomin Li, Hao Xing, Hao Xu

SENet aims to preserve the structural consistency of the character and normalize complex noise.

Image Restoration

CharFormer: A Glyph Fusion based Attentive Framework for High-precision Character Image Denoising

1 code implementation16 Jul 2022 Daqian Shi, Xiaolei Diao, Lida Shi, Hao Tang, Yang Chi, Chuntao Li, Hao Xu

Degraded images commonly exist in the general sources of character images, leading to unsatisfactory character recognition results.

Image Denoising

RZCR: Zero-shot Character Recognition via Radical-based Reasoning

no code implementations12 Jul 2022 Xiaolei Diao, Daqian Shi, Hao Tang, Qiang Shen, Yanzeng Li, Lei Wu, Hao Xu

The long-tail effect is a common issue that limits the performance of deep learning models on real-world datasets.

PI-Trans: Parallel-ConvMLP and Implicit-Transformation Based GAN for Cross-View Image Translation

1 code implementation9 Jul 2022 Bin Ren, Hao Tang, Yiming Wang, Xia Li, Wei Wang, Nicu Sebe

For semantic-guided cross-view image translation, it is crucial to learn where to sample pixels from the source view image and where to reallocate them guided by the target view semantic map, especially when there is little overlap or drastic view difference between the source and target images.

Generative Adversarial Network

Contrastive Learning from Spatio-Temporal Mixed Skeleton Sequences for Self-Supervised Skeleton-Based Action Recognition

1 code implementation7 Jul 2022 Zhan Chen, Hong Liu, Tianyu Guo, Zhengyan Chen, Pinhao Song, Hao Tang

First, SkeleMix utilizes the topological information of skeleton data to mix two skeleton sequences by randomly combing the cropped skeleton fragments (the trimmed view) with the remaining skeleton sequences (the truncated view).

Action Recognition Contrastive Learning +3

Interaction Transformer for Human Reaction Generation

1 code implementation4 Jul 2022 Baptiste Chopin, Hao Tang, Naima Otberdout, Mohamed Daoudi, Nicu Sebe

To address this limitation, we propose a novel interaction Transformer (InterFormer) consisting of a Transformer network with both temporal and spatial attention.

Unsupervised High-Resolution Portrait Gaze Correction and Animation

1 code implementation1 Jul 2022 Jichao Zhang, Jingjing Chen, Hao Tang, Enver Sangineto, Peng Wu, Yan Yan, Nicu Sebe, Wei Wang

Solving this problem using an unsupervised method remains an open problem, especially for high-resolution face images in the wild, which are not easy to annotate with gaze and head pose labels.

Image Inpainting Vocal Bursts Intensity Prediction

3D-Aware Video Generation

1 code implementation29 Jun 2022 Sherwin Bahmani, Jeong Joon Park, Despoina Paschalidou, Hao Tang, Gordon Wetzstein, Leonidas Guibas, Luc van Gool, Radu Timofte

Generative models have emerged as an essential building block for many image synthesis and editing tasks.

Image Generation Video Generation

GraphMLP: A Graph MLP-Like Architecture for 3D Human Pose Estimation

1 code implementation13 Jun 2022 Wenhao Li, Mengyuan Liu, Hong Liu, Tianyu Guo, Ti Wang, Hao Tang, Nicu Sebe

To the best of our knowledge, this is the first MLP-Like architecture for 3D human pose estimation in a single frame and a video sequence.

3D Human Pose Estimation Representation Learning

From Perception to Programs: Regularize, Overparameterize, and Amortize

no code implementations13 Jun 2022 Hao Tang, Kevin Ellis

Toward combining inductive reasoning with perception abilities, we develop techniques for neurosymbolic program synthesis where perceptual input is first parsed by neural nets into a low-dimensional interpretable representation, which is then processed by a synthesized program.

Program Synthesis

Medical Image Registration via Neural Fields

no code implementations7 Jun 2022 Shanlin Sun, Kun Han, Chenyu You, Hao Tang, Deying Kong, Junayed Naushad, Xiangyi Yan, Haoyu Ma, Pooya Khosravi, James S. Duncan, Xiaohui Xie

Traditional methods for image registration are primarily optimization-driven, finding the optimal deformations that maximize the similarity between two images.

Image Registration Medical Image Analysis +2

Real-Time Portrait Stylization on the Edge

no code implementations2 Jun 2022 Yanyu Li, Xuan Shen, Geng Yuan, Jiexiong Guan, Wei Niu, Hao Tang, Bin Ren, Yanzhi Wang

In this work we demonstrate real-time portrait stylization, specifically, translating self-portrait into cartoon or anime style on mobile devices.

DE-Net: Dynamic Text-guided Image Editing Adversarial Networks

1 code implementation2 Jun 2022 Ming Tao, Bing-Kun Bao, Hao Tang, Fei Wu, Longhui Wei, Qi Tian

To solve these limitations, we propose: (i) a Dynamic Editing Block (DEBlock) which composes different editing modules dynamically for various editing requirements.

text-guided-image-editing

AO2-DETR: Arbitrary-Oriented Object Detection Transformer

1 code implementation25 May 2022 Linhui Dai, Hong Liu, Hao Tang, Zhiwei Wu, Pinhao Song

Comprehensive experiments on several challenging datasets show that our method achieves superior performance on the AOOD task.

Decoder Inductive Bias +5

Supervised Attention in Sequence-to-Sequence Models for Speech Recognition

no code implementations25 Apr 2022 Gene-Ping Yang, Hao Tang

Attention mechanism in sequence-to-sequence models is designed to model the alignments between acoustic features and output tokens in speech recognition.

speech-recognition Speech Recognition

Autoregressive Co-Training for Learning Discrete Speech Representations

1 code implementation29 Mar 2022 Sung-Lin Yeh, Hao Tang

While several self-supervised approaches for learning discrete speech representation have been proposed, it is unclear how these seemingly similar approaches relate to each other.

Quantization

Cannot find the paper you are looking for? You can Submit a new open access paper.