Search Results for author: Xinlong Wang

Found 45 papers, 36 papers with code

Instance-Aware Embedding for Point Cloud Instance Segmentation

no code implementations ECCV 2020 Tong He, Yifan Liu, Chunhua Shen, Xinlong Wang, Changming Sun

However, these methods are unaware of the instance context and fail to realize the boundary and geometric information of an instance, which are critical to separate adjacent objects.

Instance Segmentation Semantic Segmentation

DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception

1 code implementation11 Jul 2024 Xiaotong Li, Fan Zhang, Haiwen Diao, Yueze Wang, Xinlong Wang, Ling-Yu Duan

To facilitate the cutting-edge research of MLLMs on comprehensive vision perception, we thereby propose Perceptual Fusion, using a low-budget but highly effective caption engine for complete and accurate image descriptions.

Unveiling Encoder-Free Vision-Language Models

1 code implementation17 Jun 2024 Haiwen Diao, Yufeng Cui, Xiaotong Li, Yueze Wang, Huchuan Lu, Xinlong Wang

Training pure VLMs that accept the seamless vision and language inputs, i. e., without vision encoders, remains challenging and rarely explored.

Decoder Inductive Bias +1

Beyond Literal Descriptions: Understanding and Locating Open-World Objects Aligned with Human Intentions

1 code implementation17 Feb 2024 Wenxuan Wang, Yisi Zhang, Xingjian He, Yichen Yan, Zijia Zhao, Xinlong Wang, Jing Liu

To promote classic VG towards human intention interpretation, we propose a new intention-driven visual grounding (IVG) task and build a large-scale IVG dataset termed IntentionVG with free-form intention expressions.

Visual Grounding

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

9 code implementations17 Jan 2024 Lianghui Zhu, Bencheng Liao, Qian Zhang, Xinlong Wang, Wenyu Liu, Xinggang Wang

The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models.

Image Classification object-detection +4

Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation

1 code implementation CVPR 2024 Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu

To foster future research into fine-grained visual grounding our benchmark RefCOCOm the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES.

Descriptive Object +3

Generative Multimodal Models are In-Context Learners

1 code implementation CVPR 2024 Quan Sun, Yufeng Cui, Xiaosong Zhang, Fan Zhang, Qiying Yu, Zhengxiong Luo, Yueze Wang, Yongming Rao, Jingjing Liu, Tiejun Huang, Xinlong Wang

The human ability to easily solve multimodal tasks in context (i. e., with only a few demonstrations or simple instructions), is what current multimodal systems have largely struggled to imitate.

In-Context Learning Personalized Image Generation +3

Tokenize Anything via Prompting

1 code implementation14 Dec 2023 Ting Pan, Lulu Tang, Xinlong Wang, Shiguang Shan

The semantic token is responsible for learning the semantic priors in a predefined concept space.

Decoder Visual Prompting

Unveiling Parts Beyond Objects:Towards Finer-Granularity Referring Expression Segmentation

1 code implementation13 Dec 2023 Wenxuan Wang, Tongtian Yue, Yisi Zhang, Longteng Guo, Xingjian He, Xinlong Wang, Jing Liu

To foster future research into fine-grained visual grounding, our benchmark RefCOCOm, the MRES-32M dataset and model UniRES will be publicly available at https://github. com/Rubics-Xuan/MRES

Descriptive Object +3

GeoDream: Disentangling 2D and Geometric Priors for High-Fidelity and Consistent 3D Generation

2 code implementations29 Nov 2023 Baorui Ma, Haoge Deng, Junsheng Zhou, Yu-Shen Liu, Tiejun Huang, Xinlong Wang

We justify that the refined 3D geometric priors aid in the 3D-aware capability of 2D diffusion priors, which in turn provides superior guidance for the refinement of 3D geometric priors.

3D Generation Text to 3D

CapsFusion: Rethinking Image-Text Data at Scale

1 code implementation CVPR 2024 Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao, Xinlong Wang, Jingjing Liu

To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions.

World Knowledge

JudgeLM: Fine-tuned Large Language Models are Scalable Judges

1 code implementation26 Oct 2023 Lianghui Zhu, Xinggang Wang, Xinlong Wang

To address this problem, we propose to fine-tune LLMs as scalable judges (JudgeLM) to evaluate LLMs efficiently and effectively in open-ended benchmarks.

3D-GPT: Procedural 3D Modeling with Large Language Models

no code implementations19 Oct 2023 Chunyi Sun, Junlin Han, Weijian Deng, Xinlong Wang, Zishan Qin, Stephen Gould

Our work highlights the potential of LLMs in 3D modeling, offering a basic framework for future advancements in scene generation and animation.

Scene Generation

Uni3D: Exploring Unified 3D Representation at Scale

2 code implementations10 Oct 2023 Junsheng Zhou, Jinsheng Wang, Baorui Ma, Yu-Shen Liu, Tiejun Huang, Xinlong Wang

Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language.

 Ranked #1 on Zero-shot 3D classification on Objaverse LVIS (using extra training data)

3D Object Classification Retrieval +5

Emu: Generative Pretraining in Multimodality

2 code implementations11 Jul 2023 Quan Sun, Qiying Yu, Yufeng Cui, Fan Zhang, Xiaosong Zhang, Yueze Wang, Hongcheng Gao, Jingjing Liu, Tiejun Huang, Xinlong Wang

We present Emu, a Transformer-based multimodal foundation model, which can seamlessly generate images and texts in multimodal context.

Image Captioning Temporal/Casual QA +4

Fine-Grained Visual Prompting

1 code implementation NeurIPS 2023 Lingfeng Yang, Yueze Wang, Xiang Li, Xinlong Wang, Jian Yang

Previous works have suggested that incorporating visual prompts, such as colorful boxes or circles, can improve the ability of models to recognize objects of interest.

Visual Prompting

Towards Better Entity Linking with Multi-View Enhanced Distillation

1 code implementation27 May 2023 Yi Liu, Yuan Tian, Jianxun Lian, Xinlong Wang, Yanan Cao, Fang Fang, Wen Zhang, Haizhen Huang, Denvy Deng, Qi Zhang

Aiming at learning entity representations that can match divergent mentions, this paper proposes a Multi-View Enhanced Distillation (MVD) framework, which can effectively transfer knowledge of multiple fine-grained and mention-relevant parts within entities from cross-encoders to dual-encoders.

Entity Linking Knowledge Distillation +1

Matcher: Segment Anything with One Shot Using All-Purpose Feature Matching

1 code implementation22 May 2023 Yang Liu, Muzhi Zhu, Hengtao Li, Hao Chen, Xinlong Wang, Chunhua Shen

In this work, we present Matcher, a novel perception paradigm that utilizes off-the-shelf vision foundation models to address various perception tasks.

Few-Shot Semantic Segmentation Segmentation +1

SegGPT: Segmenting Everything In Context

1 code implementation6 Apr 2023 Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang

We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images.

 Ranked #1 on Few-Shot Semantic Segmentation on PASCAL-5i (5-Shot) (using extra training data)

Few-Shot Semantic Segmentation In-Context Learning +5

Zero-Shot Video Editing Using Off-The-Shelf Image Diffusion Models

1 code implementation30 Mar 2023 Wen Wang, Yan Jiang, Kangyang Xie, Zide Liu, Hao Chen, Yue Cao, Xinlong Wang, Chunhua Shen

Our vid2vid-zero leverages off-the-shelf image diffusion models, and doesn't require training on any video.

Image Generation Video Alignment +1

EVA-CLIP: Improved Training Techniques for CLIP at Scale

4 code implementations27 Mar 2023 Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, Yue Cao

Our approach incorporates new techniques for representation learning, optimization, and augmentation, enabling EVA-CLIP to achieve superior performance compared to previous CLIP models with the same number of parameters but significantly smaller training costs.

Image Classification Representation Learning +2

EVA-02: A Visual Representation for Neon Genesis

6 code implementations20 Mar 2023 Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xinlong Wang, Yue Cao

We launch EVA-02, a next-generation Transformer-based visual representation pre-trained to reconstruct strong and robust language-aligned vision features via masked image modeling.

Affective Image Filter: Reflecting Emotions from Text to Images

no code implementations ICCV 2023 Shuchen Weng, Peixuan Zhang, Zheng Chang, Xinlong Wang, Si Li, Boxin Shi

In this work, we propose Affective Image Filter (AIF), a novel model that is able to understand the visually-abstract emotions from the text and reflect them to visually-concrete images with appropriate colors and textures.

Image Generation

SegGPT: Towards Segmenting Everything in Context

no code implementations ICCV 2023 Xinlong Wang, Xiaosong Zhang, Yue Cao, Wen Wang, Chunhua Shen, Tiejun Huang

We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images.

Few-Shot Semantic Segmentation In-Context Learning +4

Images Speak in Images: A Generalist Painter for In-Context Visual Learning

1 code implementation CVPR 2023 Xinlong Wang, Wen Wang, Yue Cao, Chunhua Shen, Tiejun Huang

In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images.

In-Context Learning Keypoint Detection +2

FreeSOLO: Learning to Segment Objects without Annotations

1 code implementation CVPR 2022 Xinlong Wang, Zhiding Yu, Shalini De Mello, Jan Kautz, Anima Anandkumar, Chunhua Shen, Jose M. Alvarez

FreeSOLO further demonstrates superiority as a strong pre-training method, outperforming state-of-the-art self-supervised pre-training methods by +9. 8% AP when fine-tuning instance segmentation with only 5% COCO masks.

Instance Segmentation object-detection +4

SOLO: A Simple Framework for Instance Segmentation

no code implementations30 Jun 2021 Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, Lei LI

Besides instance segmentation, our method yields state-of-the-art results in object detection (from our mask byproduct) and panoptic segmentation.

Image Matting Instance Segmentation +4

TFPose: Direct Human Pose Estimation with Transformers

no code implementations29 Mar 2021 Weian Mao, Yongtao Ge, Chunhua Shen, Zhi Tian, Xinlong Wang, Zhibin Wang

We propose a human pose estimation framework that solves the task in the regression-based fashion.

Ranked #26 on Pose Estimation on MPII Human Pose (using extra training data)

Pose Estimation regression

End-to-End Video Instance Segmentation with Transformers

2 code implementations CVPR 2021 Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, Huaxia Xia

Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem.

Instance Segmentation Segmentation +3

Dense Contrastive Learning for Self-Supervised Visual Pre-Training

6 code implementations CVPR 2021 Xinlong Wang, Rufeng Zhang, Chunhua Shen, Tao Kong, Lei LI

Compared to the baseline method MoCo-v2, our method introduces negligible computation overhead (only <1% slower), but demonstrates consistently superior performance when transferring to downstream dense prediction tasks including object detection, semantic segmentation and instance segmentation; and outperforms the state-of-the-art methods by a large margin.

Contrastive Learning Image Classification +7

SOLOv2: Dynamic and Fast Instance Segmentation

18 code implementations NeurIPS 2020 Xinlong Wang, Rufeng Zhang, Tao Kong, Lei LI, Chunhua Shen

Importantly, we take one step further by dynamically learning the mask head of the object segmenter such that the mask head is conditioned on the location.

object-detection Object Detection +4

DiverseDepth: Affine-invariant Depth Prediction Using Diverse Data

2 code implementations3 Feb 2020 Wei Yin, Xinlong Wang, Chunhua Shen, Yifan Liu, Zhi Tian, Songcen Xu, Changming Sun, Dou Renyin

Compared with previous learning objectives, i. e., learning metric depth or relative depth, we propose to learn the affine-invariant depth using our diverse dataset to ensure both generalization and high-quality geometric shapes of scenes.

Depth Estimation Depth Prediction

Task-Aware Monocular Depth Estimation for 3D Object Detection

1 code implementation17 Sep 2019 Xinlong Wang, Wei Yin, Tao Kong, Yuning Jiang, Lei LI, Chunhua Shen

In this paper, we first analyse the data distributions and interaction of foreground and background, then propose the foreground-background separated monocular depth estimation (ForeSeE) method, to estimate the foreground depth and background depth using separate optimization objectives and depth decoders.

3D Object Detection 3D Object Recognition +4

Associatively Segmenting Instances and Semantics in Point Clouds

3 code implementations CVPR 2019 Xinlong Wang, Shu Liu, Xiaoyong Shen, Chunhua Shen, Jiaya Jia

A 3D point cloud describes the real scene precisely and intuitively. To date how to segment diversified elements in such an informative 3D scene is rarely discussed.

Ranked #15 on 3D Instance Segmentation on S3DIS (mRec metric)

3D Instance Segmentation 3D Semantic Segmentation +1

Repulsion Loss: Detecting Pedestrians in a Crowd

2 code implementations CVPR 2018 Xinlong Wang, Tete Xiao, Yuning Jiang, Shuai Shao, Jian Sun, Chunhua Shen

In this paper, we first explore how a state-of-the-art pedestrian detector is harmed by crowd occlusion via experimentation, providing insights into the crowd occlusion problem.

Ranked #9 on Pedestrian Detection on Caltech (using extra training data)

Pedestrian Detection regression

Adversarial Generation of Training Examples: Applications to Moving Vehicle License Plate Recognition

no code implementations11 Jul 2017 Xinlong Wang, Zhipeng Man, Mingyu You, Chunhua Shen

Our experimental results on a few data sets demonstrate the effectiveness of using GAN images: an improvement of 7. 5% over a strong baseline with moderate-sized real data being available.

Image Generation License Plate Recognition

Cannot find the paper you are looking for? You can Submit a new open access paper.