Search Results for author: Hao Tan

Found 39 papers, 22 papers with code

SOHES: Self-supervised Open-world Hierarchical Entity Segmentation

no code implementations • 18 Apr 2024 • Shengcao Cao, Jiuxiang Gu, Jason Kuen, Hao Tan, Ruiyi Zhang, Handong Zhao, Ani Nenkova, Liang-Yan Gui, Tong Sun, Yu-Xiong Wang

Using raw images as the sole training data, our method achieves unprecedented performance in self-supervised open-world segmentation, marking a significant milestone towards high-quality open-world entity segmentation in the absence of human-annotated masks.

Paper
Add Code

MeshLRM: Large Reconstruction Model for High-Quality Mesh

no code implementations • 18 Apr 2024 • Xinyue Wei, Kai Zhang, Sai Bi, Hao Tan, Fujun Luan, Valentin Deschaintre, Kalyan Sunkavalli, Hao Su, Zexiang Xu

This allows for end-to-end mesh reconstruction by fine-tuning a pre-trained NeRF LRM with mesh rendering.

Paper
Add Code

PVLR: Prompt-driven Visual-Linguistic Representation Learning for Multi-Label Image Recognition

no code implementations • 31 Jan 2024 • Hao Tan, Zichang Tan, Jun Li, Jun Wan, Zhen Lei

In contrast to the unidirectional fusion in previous works, we introduce a Dual-Modal Attention (DMA) that enables bidirectional interaction between textual and visual features, yielding context-aware label representations and semantic-related visual representations, which are subsequently used to calculate similarities and generate final predictions for all labels.

Representation Learning

Paper
Add Code

Template-Free Single-View 3D Human Digitalization with Diffusion-Guided LRM

no code implementations • 22 Jan 2024 • Zhenzhen Weng, Jingyuan Liu, Hao Tan, Zhan Xu, Yang Zhou, Serena Yeung-Levy, Jimei Yang

We present Human-LRM, a diffusion-guided feed-forward model that predicts the implicit field of a human from a single image.

Paper
Add Code

Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning

no code implementations • 21 Dec 2023 • Desai Xie, Jiahao Li, Hao Tan, Xin Sun, Zhixin Shu, Yi Zhou, Sai Bi, Sören Pirk, Arie E. Kaufman

To this end, we introduce Carve3D, an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric, to enhance the consistency of multi-view diffusion models.

Language Modelling Large Language Model +1

Paper
Add Code

Compound Text-Guided Prompt Tuning via Image-Adaptive Cues

1 code implementation • 11 Dec 2023 • Hao Tan, Jun Li, Yizhuang Zhou, Jun Wan, Zhen Lei, Xiangyu Zhang

We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly.

Domain Generalization

Paper
Code

PF-LRM: Pose-Free Large Reconstruction Model for Joint Pose and Shape Prediction

no code implementations • 20 Nov 2023 • Peng Wang, Hao Tan, Sai Bi, Yinghao Xu, Fujun Luan, Kalyan Sunkavalli, Wenping Wang, Zexiang Xu, Kai Zhang

We propose a Pose-Free Large Reconstruction Model (PF-LRM) for reconstructing a 3D object from a few unposed images even with little visual overlap, while simultaneously estimating the relative camera poses in ~1. 3 seconds on a single A100 GPU.

3D Reconstruction Image to 3D +1

Paper
Add Code

DMV3D: Denoising Multi-View Diffusion using 3D Large Reconstruction Model

no code implementations • 15 Nov 2023 • Yinghao Xu, Hao Tan, Fujun Luan, Sai Bi, Peng Wang, Jiahao Li, Zifan Shi, Kalyan Sunkavalli, Gordon Wetzstein, Zexiang Xu, Kai Zhang

We propose \textbf{DMV3D}, a novel 3D generation approach that uses a transformer-based 3D large reconstruction model to denoise multi-view diffusion.

3D Generation Denoising +2

Paper
Add Code

Federated Skewed Label Learning with Logits Fusion

no code implementations • 14 Nov 2023 • Yuwei Wang, Runhan Li, Hao Tan, Xuefeng Jiang, Sheng Sun, Min Liu, Bo Gao, Zhiyuan Wu

By fusing the logits of the two models, the private weak learner can capture the variance of different data, regardless of their category.

Federated Learning

Paper
Add Code

Instant3D: Fast Text-to-3D with Sparse-View Generation and Large Reconstruction Model

no code implementations • 10 Nov 2023 • Jiahao Li, Hao Tan, Kai Zhang, Zexiang Xu, Fujun Luan, Yinghao Xu, Yicong Hong, Kalyan Sunkavalli, Greg Shakhnarovich, Sai Bi

Text-to-3D with diffusion models has achieved remarkable progress in recent years.

Text to 3D

Paper
Add Code

LRM: Large Reconstruction Model for Single Image to 3D

1 code implementation • 8 Nov 2023 • Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, Hao Tan

We propose the first Large Reconstruction Model (LRM) that predicts the 3D model of an object from a single input image within just 5 seconds.

Image to 3D

718

Paper
Code

Scaling Data Generation in Vision-and-Language Navigation

1 code implementation • ICCV 2023 • Zun Wang, Jialu Li, Yicong Hong, Yi Wang, Qi Wu, Mohit Bansal, Stephen Gould, Hao Tan, Yu Qiao

Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents.

Imitation Learning Vision and Language Navigation +1

135

Paper
Code

Boosting Punctuation Restoration with Data Generation and Reinforcement Learning

no code implementations • 24 Jul 2023 • Viet Dac Lai, Abel Salinas, Hao Tan, Trung Bui, Quan Tran, Seunghyun Yoon, Hanieh Deilamsalehy, Franck Dernoncourt, Thien Huu Nguyen

Punctuation restoration is an important task in automatic speech recognition (ASR) which aim to restore the syntactic structure of generated ASR texts to improve readability.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +3

Paper
Add Code

Learning Navigational Visual Representations with Semantic Map Supervision

1 code implementation • ICCV 2023 • Yicong Hong, Yang Zhou, Ruiyi Zhang, Franck Dernoncourt, Trung Bui, Stephen Gould, Hao Tan

Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot.

Representation Learning Self-Supervised Learning +2

Paper
Code

DocumentCLIP: Linking Figures and Main Body Text in Reflowed Documents

1 code implementation • 9 Jun 2023 • Fuxiao Liu, Hao Tan, Chris Tensmeyer

In this work, we propose DocumentCLIP, a salience-aware contrastive learning framework to enforce vision-language pretraining models to comprehend the interaction between images and longer text within documents.

Contrastive Learning document understanding

Paper
Code

Graph Propagation Transformer for Graph Representation Learning

1 code implementation • 19 May 2023 • Zhe Chen, Hao Tan, Tao Wang, Tianrun Shen, Tong Lu, Qiuying Peng, Cheng Cheng, Yue Qi

The core insight of our method is to fully consider the information propagation among nodes and edges in a graph when building the attention module in the transformer blocks.

Ranked #2 on Graph Regression on PCQM4M-LSC (Validation MAE metric)

Graph Learning Graph Property Prediction +3

Paper
Code

Tiny-Attention Adapter: Contexts Are More Important Than the Number of Parameters

no code implementations • 18 Oct 2022 • Hongyu Zhao, Hao Tan, Hongyuan Mei

Our tiny-attention adapter learns to modify the hidden states at each position directly conditioned on the hidden states at all the other positions, which is missed by the previously proposed adapters.

Language Modelling Transfer Learning

Paper
Add Code

Adversarial Attacks on ASR Systems: An Overview

no code implementations • 3 Aug 2022 • Xiao Zhang, Hao Tan, Xuan Huang, Denghui Zhang, Keke Tang, Zhaoquan Gu

With the development of hardware and algorithms, ASR(Automatic Speech Recognition) systems evolve a lot.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Paper
Add Code

CLEAR: Improving Vision-Language Navigation with Cross-Lingual, Environment-Agnostic Representations

1 code implementation • Findings (NAACL) 2022 • Jialu Li, Hao Tan, Mohit Bansal

Empirically, on the Room-Across-Room dataset, we show that our multilingual agent gets large improvements in all metrics over the strong baseline model when generalizing to unseen environments with the cross-lingual language representation and the environment-agnostic visual representation.

Navigate Representation Learning +2

Paper
Code

EnvEdit: Environment Editing for Vision-and-Language Navigation

1 code implementation • CVPR 2022 • Jialu Li, Hao Tan, Mohit Bansal

Training on these edit-augmented environments prevents the agent from overfitting to existing environments and helps generalize better to new, unseen environments.

Ranked #2 on Vision and Language Navigation on RxR (using extra training data)

Data Augmentation Navigate +1

Paper
Code

How Much Can CLIP Benefit Vision-and-Language Tasks?

4 code implementations • 13 Jul 2021 • Sheng Shen, Liunian Harold Li, Hao Tan, Mohit Bansal, Anna Rohrbach, Kai-Wei Chang, Zhewei Yao, Kurt Keutzer

Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using a relatively small set of manually-annotated data (as compared to web-crawled data), to perceive the visual world.

Ranked #4 on Vision and Language Navigation on RxR (using extra training data)

Question Answering Vision and Language Navigation +2

381

Paper
Code

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

1 code implementation • NeurIPS 2021 • Zineng Tang, Jaemin Cho, Hao Tan, Mohit Bansal

We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset.

Image Retrieval Knowledge Distillation +6

Paper
Code

VIMPAC: Video Pre-Training via Masked Token Prediction and Contrastive Learning

1 code implementation • 21 Jun 2021 • Hao Tan, Jie Lei, Thomas Wolf, Mohit Bansal

Unlike language, where the text tokens are more independent, neighboring video tokens typically have strong correlations (e. g., consecutive video frames usually look very similar), and hence uniformly masking individual tokens will make the task too trivial to learn useful representations.

Ranked #10 on Action Recognition on Diving-48

Action Classification Action Recognition +2

Paper
Code

Improving Cross-Modal Alignment in Vision Language Navigation via Syntactic Information

1 code implementation • NAACL 2021 • Jialu Li, Hao Tan, Mohit Bansal

One key challenge in this task is to ground instructions with the current visual information that the agent perceives.

Navigate Sentence +1

Paper
Code

Unifying Vision-and-Language Tasks via Text Generation

2 code implementations • 4 Feb 2021 • Jaemin Cho, Jie Lei, Hao Tan, Mohit Bansal

On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models.

Ranked #3 on Image Captioning on nocaps val

Conditional Text Generation Image Captioning +7

350

Paper
Code

ArraMon: A Joint Navigation-Assembly Instruction Interpretation Task in Dynamic Environments

no code implementations • Findings of the Association for Computational Linguistics 2020 • Hyounghun Kim, Abhay Zala, Graham Burri, Hao Tan, Mohit Bansal

During this task, the agent (similar to a PokeMON GO player) is asked to find and collect different target objects one-by-one by navigating based on natural language instructions in a complex, realistic outdoor environment, but then also ARRAnge the collected objects part-by-part in an egocentric grid-layout environment.

Referring Expression Referring Expression Comprehension +1

Paper
Add Code

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision

1 code implementation • EMNLP 2020 • Hao Tan, Mohit Bansal

We find that the main reason hindering this exploration is the large divergence in magnitude and distributions between the visually-grounded language datasets and pure-language corpora.

Image Captioning Language Modelling

186

Paper
Code

MAF: Multimodal Alignment Framework for Weakly-Supervised Phrase Grounding

1 code implementation • EMNLP 2020 • Qinxin Wang, Hao Tan, Sheng Shen, Michael W. Mahoney, Zhewei Yao

Phrase localization is a task that studies the mapping from textual phrases to regions of an image.

Phrase Grounding

Paper
Code

RelativeNAS: Relative Neural Architecture Search via Slow-Fast Learning

2 code implementations • 14 Sep 2020 • Hao Tan, Ran Cheng, Shihua Huang, Cheng He, Changxiao Qiu, Fan Yang, Ping Luo

Despite the remarkable successes of Convolutional Neural Networks (CNNs) in computer vision, it is time-consuming and error-prone to manually design a CNN.

Keypoint Detection Neural Architecture Search +3

Paper
Code

Diagnosing the Environment Bias in Vision-and-Language Navigation

1 code implementation • 6 May 2020 • Yubo Zhang, Hao Tan, Mohit Bansal

Vision-and-Language Navigation (VLN) requires an agent to follow natural-language instructions, explore the given environments, and reach the desired target locations.

Vision and Language Navigation

Paper
Code

The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

1 code implementation • EMNLP 2020 • Xiang Zhou, Yixin Nie, Hao Tan, Mohit Bansal

For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve.

Model Selection Natural Language Inference +1

Paper
Code

Modality-Balanced Models for Visual Dialogue

no code implementations • 17 Jan 2020 • Hyounghun Kim, Hao Tan, Mohit Bansal

The Visual Dialog task requires a model to exploit both image and conversational context information to generate the next response to the dialogue.

Visual Dialog

Paper
Add Code

LXMERT: Learning Cross-Modality Encoder Representations from Transformers

9 code implementations • IJCNLP 2019 • Hao Tan, Mohit Bansal

In LXMERT, we build a large-scale Transformer model that consists of three encoders: an object relationship encoder, a language encoder, and a cross-modality encoder.

Ranked #1 on Visual Question Answering (VQA) on VizWiz 2018

Language Modelling Masked Language Modeling +4

124,527

Paper
Code

Expressing Visual Relationships via Language

1 code implementation • ACL 2019 • Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, Mohit Bansal

To push forward the research in this direction, we first introduce a new language-guided image editing dataset that contains a large number of real image pairs with corresponding editing instructions.

Image Captioning Retrieval

Paper
Code

Enabling Robots to Understand Incomplete Natural Language Instructions Using Commonsense Reasoning

no code implementations • 29 Apr 2019 • Haonan Chen, Hao Tan, Alan Kuntz, Mohit Bansal, Ron Alterovitz

Our results show the feasibility of a robot learning commonsense knowledge automatically from web-based textual corpora, and the power of learned commonsense reasoning models in enabling a robot to autonomously perform tasks based on incomplete natural language instructions.

Common Sense Reasoning Language Modelling

Paper
Add Code

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout

1 code implementation • NAACL 2019 • Hao Tan, Licheng Yu, Mohit Bansal

Next, we apply semi-supervised learning (via back-translation) on these dropped-out environments to generate new paths and instructions.

Ranked #1 on Vision-Language Navigation on Room2Room

Navigate Translation +1

120

Paper
Code

Object Ordering with Bidirectional Matchings for Visual Reasoning

no code implementations • NAACL 2018 • Hao Tan, Mohit Bansal

Visual reasoning with compositional natural language instructions, e. g., based on the newly-released Cornell Natural Language Visual Reasoning (NLVR) dataset, is a challenging task, where the model needs to have the ability to create an accurate mapping between the diverse phrases and the several objects placed in complex arrangements in the image.

Object Visual Reasoning

Paper
Add Code

Source-Target Inference Models for Spatial Instruction Understanding

no code implementations • 12 Jul 2017 • Hao Tan, Mohit Bansal

Models that can execute natural language instructions for situated robotic tasks such as assembly and navigation have several useful applications in homes, offices, and remote scenarios.

Position Position regression +2

Paper
Add Code

A Joint Speaker-Listener-Reinforcer Model for Referring Expressions

2 code implementations • CVPR 2017 • Licheng Yu, Hao Tan, Mohit Bansal, Tamara L. Berg

The speaker generates referring expressions, the listener comprehends referring expressions, and the reinforcer introduces a reward function to guide sampling of more discriminative expressions.

Referring Expression Referring Expression Comprehension

Paper
Code

Cannot find the paper you are looking for? You can Submit a new open access paper.