Search Results for author: Hao Tan

Found 66 papers, 30 papers with code

Test-Time Training Done Right

no code implementations29 May 2025 Tianyuan Zhang, Sai Bi, Yicong Hong, Kai Zhang, Fujun Luan, Songlin Yang, Kalyan Sunkavalli, William T. Freeman, Hao Tan

We validate our approach across diverse modalities and tasks, including novel view synthesis with image set, language models, and auto-regressive video diffusion.

2k Novel View Synthesis

Neural BRDF Importance Sampling by Reparameterization

no code implementations13 May 2025 Liwen Wu, Sai Bi, Zexiang Xu, Hao Tan, Kai Zhang, Fujun Luan, Haolin Lu, Ravi Ramamoorthi

Neural bidirectional reflectance distribution functions (BRDFs) have emerged as popular material representations for enhancing realism in physically-based rendering.

Gaussian Mixture Flow Matching Models

1 code implementation7 Apr 2025 Hansheng Chen, Kai Zhang, Hao Tan, Zexiang Xu, Fujun Luan, Leonidas Guibas, Gordon Wetzstein, Sai Bi

Diffusion models approximate the denoising distribution as a Gaussian and predict its mean, whereas flow matching models reparameterize the Gaussian mean as flow velocity.

Denoising Image Generation

Recover and Match: Open-Vocabulary Multi-Label Recognition through Knowledge-Constrained Optimal Transport

1 code implementation CVPR 2025 Hao Tan, Zichang Tan, Jun Li, Ajian Liu, Jun Wan, Zhen Lei

Identifying multiple novel classes in an image, known as open-vocabulary multi-label recognition, is a challenging task in computer vision.

VEGGIE: Instructional Editing and Reasoning Video Concepts with Grounded Generation

no code implementations18 Mar 2025 Shoubin Yu, Difan Liu, Ziqiao Ma, Yicong Hong, Yang Zhou, Hao Tan, Joyce Chai, Mohit Bansal

To support diverse tasks and complex instructions, we employ a curriculum learning strategy: first aligning the MLLM and video diffusion model with large-scale instructional image editing data, followed by end-to-end fine-tuning on high-quality multitask video data.

Reasoning Segmentation Video Editing

RigAnything: Template-Free Autoregressive Rigging for Diverse 3D Assets

no code implementations13 Feb 2025 Isabella Liu, Zhan Xu, Wang Yifan, Hao Tan, Zexiang Xu, Xiaolong Wang, Hao Su, Zifan Shi

To achieve this, we organize the joints in a breadth-first search (BFS) order, enabling the skeleton to be defined as a sequence of 3D locations and the parent index.

Adaptive Few-shot Prompting for Machine Translation with Pre-trained Language Models

no code implementations3 Jan 2025 Lei Tang, Jinghui Qin, Wenxuan Ye, Hao Tan, Zhijing Yang

Recently, Large language models (LLMs) with in-context learning have demonstrated remarkable potential in handling neural machine translation.

In-Context Learning Machine Translation +2

Large-scale Multi-view Tensor Clustering with Implicit Linear Kernels

no code implementations CVPR 2025 Jiyuan Liu, Xinwang Liu, Chuankun Li, Xinhang Wan, Hao Tan, Yi Zhang, Weixuan Liang, Qian Qu, Yu Feng, Renxiang Guan, Ke Liang

On its basis, a novel large-scale multi-view tensor clustering method is developed by incorporating the pair-wise similarities with implicit linear kernel function.

Clustering

LazyDiT: Lazy Learning for the Acceleration of Diffusion Transformers

no code implementations17 Dec 2024 Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Yanyu Li, Yifan Gong, Kai Zhang, Hao Tan, Jason Kuen, Henghui Ding, Zhihao Shu, Wei Niu, Pu Zhao, Yanzhi Wang, Jiuxiang Gu

In this paper, we show that performing the full computation of the model at each diffusion step is unnecessary, as some computations can be skipped by lazily reusing the results of previous steps.

Denoising

Numerical Pruning for Efficient Autoregressive Models

no code implementations17 Dec 2024 Xuan Shen, Zhao Song, Yufa Zhou, Bo Chen, Jing Liu, Ruiyi Zhang, Ryan A. Rossi, Hao Tan, Tong Yu, Xiang Chen, Yufan Zhou, Tong Sun, Pu Zhao, Yanzhi Wang, Jiuxiang Gu

Transformers have emerged as the leading architecture in deep learning, proving to be versatile and highly effective across diverse domains beyond language and image processing.

Decoder Image Generation

Turbo3D: Ultra-fast Text-to-3D Generation

no code implementations CVPR 2025 Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang

By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency.

3D Generation Text to 3D

Buffer Anytime: Zero-Shot Video Depth and Normal from Image Priors

no code implementations CVPR 2025 Zhengfei Kuang, Tianyuan Zhang, Kai Zhang, Hao Tan, Sai Bi, Yiwei Hu, Zexiang Xu, Milos Hasan, Gordon Wetzstein, Fujun Luan

We present Buffer Anytime, a framework for estimation of depth and normal maps (which we call geometric buffers) from video that eliminates the need for paired video--depth and video--normal training data.

Optical Flow Estimation

Generating 3D-Consistent Videos from Unposed Internet Photos

no code implementations CVPR 2025 Gene Chou, Kai Zhang, Sai Bi, Hao Tan, Zexiang Xu, Fujun Luan, Bharath Hariharan, Noah Snavely

We design a self-supervised method that takes advantage of the consistency of videos and variability of multiview internet photos to train a scalable, 3D-aware video model without any 3D annotations such as camera parameters.

LVSM: A Large View Synthesis Model with Minimal 3D Inductive Bias

no code implementations22 Oct 2024 Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, Zexiang Xu

We propose the Large View Synthesis Model (LVSM), a novel transformer-based approach for scalable and generalizable novel view synthesis from sparse-view inputs.

3DGS Decoder +5

Long-LRM: Long-sequence Large Reconstruction Model for Wide-coverage Gaussian Splats

no code implementations16 Oct 2024 Chen Ziwen, Hao Tan, Kai Zhang, Sai Bi, Fujun Luan, Yicong Hong, Li Fuxin, Zexiang Xu

Unlike previous feed-forward models that are limited to processing 1~4 input images and can only reconstruct a small portion of a large scene, Long-LRM reconstructs the entire scene in a single feed-forward step.

Progressive Autoregressive Video Diffusion Models

1 code implementation10 Oct 2024 Desai Xie, Zhan Xu, Yicong Hong, Hao Tan, Difan Liu, Feng Liu, Arie Kaufman, Yang Zhou

In this work, we introduce a more natural formulation of autoregressive long video generation by revisiting the noise level assumption in video diffusion models.

Denoising Video Denoising +1

RelitLRM: Generative Relightable Radiance for Large Reconstruction Models

no code implementations8 Oct 2024 Tianyuan Zhang, Zhengfei Kuang, Haian Jin, Zexiang Xu, Sai Bi, Hao Tan, He Zhang, Yiwei Hu, Milos Hasan, William T. Freeman, Kai Zhang, Fujun Luan

We propose RelitLRM, a Large Reconstruction Model (LRM) for generating high-quality Gaussian splatting representations of 3D objects under novel illuminations from sparse (4-8) posed images captured under unknown static lighting.

Inverse Rendering

Identifying Speakers in Dialogue Transcripts: A Text-based Approach Using Pretrained Language Models

1 code implementation16 Jul 2024 Minh Nguyen, Franck Dernoncourt, Seunghyun Yoon, Hanieh Deilamsalehy, Hao Tan, Ryan Rossi, Quan Hung Tran, Trung Bui, Thien Huu Nguyen

We introduce an approach to identifying speaker names in dialogue transcripts, a crucial task for enhancing content accessibility and searchability in digital media archives.

Attribute Speaker Identification +2

LRM-Zero: Training Large Reconstruction Models with Synthesized Data

1 code implementation13 Jun 2024 Desai Xie, Sai Bi, Zhixin Shu, Kai Zhang, Zexiang Xu, Yi Zhou, Sören Pirk, Arie Kaufman, Xin Sun, Hao Tan

We demonstrate that our LRM-Zero, trained with our fully synthesized Zeroverse, can achieve high visual quality in the reconstruction of real-world objects, competitive with models trained on Objaverse.

3D Reconstruction

GS-LRM: Large Reconstruction Model for 3D Gaussian Splatting

no code implementations30 Apr 2024 Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, Zexiang Xu

We propose GS-LRM, a scalable large reconstruction model that can predict high-quality 3D Gaussian primitives from 2-4 posed sparse images in 0. 23 seconds on single A100 GPU.

3D Generation

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

1 code implementation19 Apr 2024 Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time?

SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

no code implementations18 Apr 2024 Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang

Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks.

Segmentation

PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition

no code implementations31 Jan 2024 Hao Tan, Zichang Tan, Jun Li, Jun Wan, Zhen Lei

In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features, yielding context-aware label representations and semantic-related visual representations, which are subsequently used to calculate similarities and generate final predictions for all labels.

Multi-Label Image Recognition Representation Learning

Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM

no code implementations22 Jan 2024 Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang

We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image.

Decoder NeRF

Building Vision-Language Models on Solid Foundations with Masked Distillation

no code implementations CVPR 2024 Sepehr Sameni, Kushal Kafle, Hao Tan, Simon Jenni

Recent advancements in Vision-Language Models (VLMs) have marked a significant leap in bridging the gap between computer vision and natural language processing.

Contrastive Learning Knowledge Distillation +4

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

no code implementations CVPR 2024 Desai Xie, Jiahao Li, Hao Tan, Xin Sun, Zhixin Shu, Yi Zhou, Sai Bi, Sören Pirk, Arie E. Kaufman

To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models.

Language Modelling Large Language Model +2

Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

1 code implementation11 Dec 2023 Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, Xiangyu Zhang

We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly.

Domain Generalization

PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

no code implementations20 Nov 2023 Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, Kai Zhang

We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1. 3 seconds on a single A100 GPU.

3D Reconstruction Image to 3D +1

DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model

no code implementations15 Nov 2023 Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, Kai Zhang

We propose \textbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion.

3D Generation Denoising +3

Federated Skewed Label Learning with Logits Fusion

no code implementations14 Nov 2023 Yuwei Wang, Runhan Li, Hao Tan, Xuefeng Jiang, Sheng Sun, Min Liu, Bo Gao, Zhiyuan Wu

By fusing the logits of the two models, the private weak learner can capture the variance of different data, regardless of their category.

Federated Learning

LRM: Large Reconstruction Model for Single Image to 3D

1 code implementation8 Nov 2023 Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan

We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds.

Image to 3D NeRF

Scaling Data Generation in Vision-and-Language Navigation

1 code implementation ICCV 2023 Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.

Imitation Learning Vision and Language Navigation +1

Boosting Punctuation Restoration with Data Generation and Reinforcement Learning

no code implementations24 Jul 2023 Viet Dac Lai, Abel Salinas, Hao Tan, Trung Bui, Quan Tran, Seunghyun Yoon, Hanieh Deilamsalehy, Franck Dernoncourt, Thien Huu Nguyen

Punctuation restoration is an important task in automatic speech recognition (ASR) which aim to restore the syntactic structure of generated ASR texts to improve readability.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +4

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

1 code implementation9 Jun 2023 Fuxiao Liu, Hao Tan, Chris Tensmeyer

In this work, we propose DocumentCLIP, a salience-aware contrastive learning framework to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents.

Contrastive Learning document understanding

Graph Propagation Transformer for Graph Representation Learning

1 code implementation19 May 2023 Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu, Qiuying Peng, Cheng Cheng, Yue Qi

The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks.

Ranked #2 on Graph Regression on PCQM4M-LSC (Validation MAE metric)

Graph Learning Graph Property Prediction +3

Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters

no code implementations18 Oct 2022 Hongyu Zhao, Hao Tan, Hongyuan Mei

Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters.

Language Modeling Language Modelling +2

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

1 code implementation Findings (NAACL) 2022 Jialu Li, Hao Tan, Mohit Bansal

Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation.

Navigate Representation Learning +2

EnvEdit: Environment Editing for Vision-and-Language Navigation

1 code implementation CVPR 2022 Jialu Li, Hao Tan, Mohit Bansal

Training on these edit-augmented environments prevents the agent from overfitting to existing environments and helps generalize better to new, unseen environments.

Ranked #2 on Vision and Language Navigation on RxR (using extra training data)

Data Augmentation Diversity +2

How Much Can CLIP Benefit Vision-and-Language Tasks?

4 code implementations13 Jul 2021 Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Ranked #4 on Vision and Language Navigation on RxR (using extra training data)

Question Answering Vision and Language Navigation +2

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

1 code implementation NeurIPS 2021 Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Image Retrieval Knowledge Distillation +7

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

1 code implementation21 Jun 2021 Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e. g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations.

Action Classification Action Recognition +2

Unifying Vision-and-Language Tasks via Text Generation

2 code implementations4 Feb 2021 Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

Conditional Text Generation Decoder +9

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

no code implementations Findings of the Association for Computational Linguistics 2020 Hyounghun Kim, Abhay Zala, Graham Burri, Hao Tan, Mohit Bansal

During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment.

Referring Expression Referring Expression Comprehension +1

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

1 code implementation EMNLP 2020 Hao Tan, Mohit Bansal

We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora.

Image Captioning Language Modeling +1

RelativeNAS: Relative Neural Architecture Search via Slow-Fast Learning

2 code implementations14 Sep 2020 Hao Tan, Ran Cheng, Shihua Huang, Cheng He, Changxiao Qiu, Fan Yang, Ping Luo

Despite the remarkable successes of Convolutional Neural Networks (CNNs) in computer vision, it is time-consuming and error-prone to manually design a CNN.

Keypoint Detection Neural Architecture Search +3

Diagnosing the Environment Bias in Vision-and-Language Navigation

1 code implementation6 May 2020 Yubo Zhang, Hao Tan, Mohit Bansal

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.

Vision and Language Navigation

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

1 code implementation EMNLP 2020 Xiang Zhou, Yixin Nie, Hao Tan, Mohit Bansal

For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve.

Model Selection Natural Language Inference +1

Modality-Balanced Models for Visual Dialogue

no code implementations17 Jan 2020 Hyounghun Kim, Hao Tan, Mohit Bansal

The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.

Visual Dialog

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

9 code implementations IJCNLP 2019 Hao Tan, Mohit Bansal

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Language Modeling Language Modelling +5

Expressing Visual Relationships via Language

1 code implementation ACL 2019 Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, Mohit Bansal

To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.

Decoder Image Captioning +1

Enabling Robots to Understand Incomplete Natural Language Instructions Using Commonsense Reasoning

no code implementations29 Apr 2019 Haonan Chen, Hao Tan, Alan Kuntz, Mohit Bansal, Ron Alterovitz

Our results show the feasibility of a robot learning commonsense knowledge automatically from web-based textual corpora, and the power of learned commonsense reasoning models in enabling a robot to autonomously perform tasks based on incomplete natural language instructions.

Common Sense Reasoning Language Modeling +1

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

1 code implementation NAACL 2019 Hao Tan, Licheng Yu, Mohit Bansal

Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.

Navigate Reinforcement Learning +2

Object Ordering with Bidirectional Matchings for Visual Reasoning

no code implementations NAACL 2018 Hao Tan, Mohit Bansal

Visual reasoning with compositional natural language instructions, e. g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image.

Object Visual Reasoning

Source-Target Inference Models for Spatial Instruction Understanding

no code implementations12 Jul 2017 Hao Tan, Mohit Bansal

Models that can execute natural language instructions for situated robotic tasks such as assembly and navigation have several useful applications in homes, offices, and remote scenarios.

Position Position regression +2

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

2 code implementations CVPR 2017 Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.

Referring Expression Referring Expression Comprehension

Cannot find the paper you are looking for? You can Submit a new open access paper.