Search Results for author: Yuying Ge

Found 22 papers, 16 papers with code

Supervised Fine-tuning in turn Improves Visual Foundation Models

1 code implementation • 18 Jan 2024 • Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan

Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years.

Paper
Code

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

1 code implementation • 14 Dec 2023 • Jinguo Zhu, Xiaohan Ding, Yixiao Ge, Yuying Ge, Sijie Zhao, Hengshuang Zhao, Xiaohua Wang, Ying Shan

In combination with the existing text tokenizer and detokenizer, this framework allows for the encoding of interleaved image-text data into a multimodal sequence, which can subsequently be fed into the transformer model.

Image Captioning In-Context Learning +4

Paper
Code

EgoPlan-Bench: Benchmarking Egocentric Embodied Planning with Multimodal Large Language Models

1 code implementation • 11 Dec 2023 • Yi Chen, Yuying Ge, Yixiao Ge, Mingyu Ding, Bohao Li, Rui Wang, Ruifeng Xu, Ying Shan, Xihui Liu

Given diverse environmental inputs, including real-time task progress, visual observations, and open-form language instructions, a proficient task planner is expected to predict feasible actions, which is a feat inherently achievable by Multimodal Large Language Models (MLLMs).

Benchmarking Human-Object Interaction Detection

Paper
Code

SEED-Bench-2: Benchmarking Multimodal Large Language Models

1 code implementation • 28 Nov 2023 • Bohao Li, Yuying Ge, Yixiao Ge, Guangzhi Wang, Rui Wang, Ruimao Zhang, Ying Shan

Multimodal large language models (MLLMs), building upon the foundation of powerful large language models (LLMs), have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3).

Benchmarking Image Generation +1

229

Paper
Code

ViT-Lens: Towards Omni-modal Representations

1 code implementation • 27 Nov 2023 • Weixian Lei, Yixiao Ge, Kun Yi, Jianfeng Zhang, Difei Gao, Dylan Sun, Yuying Ge, Ying Shan, Mike Zheng Shou

In this paper, we present ViT-Lens-2 that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space.

EEG Image Generation +2

127

Paper
Code

Making LLaMA SEE and Draw with SEED Tokenizer

1 code implementation • 2 Oct 2023 • Yuying Ge, Sijie Zhao, Ziyun Zeng, Yixiao Ge, Chen Li, Xintao Wang, Ying Shan

We identify two crucial design principles: (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs.

multimodal generation

461

Paper
Code

GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

1 code implementation • 31 Aug 2023 • Yanjie Ze, Ge Yan, Yueh-Hua Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, Xiaolong Wang

To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e. g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel.

Decision Making

Paper
Code

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

2 code implementations • 30 Jul 2023 • Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, Ying Shan

Based on powerful Large Language Models (LLMs), recent generative Multimodal Large Language Models (MLLMs) have gained prominence as a pivotal research area, exhibiting remarkable capability for both comprehension and generation.

Benchmarking Multiple-choice

351

Paper
Code

Planting a SEED of Vision in Large Language Model

1 code implementation • 16 Jul 2023 • Yuying Ge, Yixiao Ge, Ziyun Zeng, Xintao Wang, Ying Shan

Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.)

Language Modelling Large Language Model +1

461

Paper
Code

Align, Adapt and Inject: Sound-guided Unified Image Generation

no code implementations • 20 Jun 2023 • Yue Yang, Kaipeng Zhang, Yuying Ge, Wenqi Shao, Zeyue Xue, Yu Qiao, Ping Luo

Then, we propose the audio adapter to adapt audio representation into an audio token enriched with specific semantics, which can be injected into a frozen T2I model flexibly.

Image Generation Retrieval +1

Paper
Add Code

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models

no code implementations • 15 Jun 2023 • Junting Pan, Ziyi Lin, Yuying Ge, Xiatian Zhu, Renrui Zhang, Yi Wang, Yu Qiao, Hongsheng Li

Video Question Answering (VideoQA) has been significantly advanced from the scaling of recent Large Language Models (LLMs).

Ranked #3 on Temporal/Casual QA on NExT-QA (using extra training data)

Domain Generalization Retrieval +2

Paper
Add Code

Policy Adaptation from Foundation Model Feedback

no code implementations • CVPR 2023 • Yuying Ge, Annabella Macaluso, Li Erran Li, Ping Luo, Xiaolong Wang

When deploying the trained policy to a new task or a new environment, we first let the policy play with randomly generated instructions to record the demonstrations.

Decision Making

Paper
Add Code

Learning Transferable Spatiotemporal Representations from Natural Script Knowledge

1 code implementation • CVPR 2023 • Ziyun Zeng, Yuying Ge, Xihui Liu, Bin Chen, Ping Luo, Shu-Tao Xia, Yixiao Ge

Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years.

Descriptive Representation Learning +1

Paper
Code

MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval

1 code implementation • 26 Apr 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Alex Jinpeng Wang, Jianping Wu, Ying Shan, XiaoHu Qie, Ping Luo

Dominant pre-training work for video-text retrieval mainly adopt the "dual-encoder" architectures to enable efficient retrieval, where two separate encoders are used to contrast global video and text representations, but ignore detailed local semantics.

Ranked #7 on Zero-Shot Video Retrieval on MSVD

Action Recognition Retrieval +6

130

Paper
Code

All in One: Exploring Unified Video-Language Pre-training

1 code implementation • CVPR 2023 • Alex Jinpeng Wang, Yixiao Ge, Rui Yan, Yuying Ge, Xudong Lin, Guanyu Cai, Jianping Wu, Ying Shan, XiaoHu Qie, Mike Zheng Shou

In this work, we for the first time introduce an end-to-end video-language model, namely \textit{all-in-one Transformer}, that embeds raw video and textual signals into joint representations using a unified backbone architecture.

Ranked #6 on TGIF-Transition on TGIF-QA (using extra training data)

Language Modelling Multiple-choice +10

272

Paper
Code

Bridging Video-text Retrieval with Multiple Choice Questions

2 code implementations • CVPR 2022 • Yuying Ge, Yixiao Ge, Xihui Liu, Dian Li, Ying Shan, XiaoHu Qie, Ping Luo

As an additional benefit, our method achieves competitive results with much shorter pre-training videos on single-modality downstream tasks, e. g., action recognition with linear evaluation.

Ranked #8 on Zero-Shot Video Retrieval on MSVD

Action Recognition Multiple-choice +8

2,972

Paper
Code

MetaDance: Few-shot Dancing Video Retargeting via Temporal-aware Meta-learning

no code implementations • 13 Jan 2022 • Yuying Ge, Yibing Song, Ruimao Zhang, Ping Luo

Dancing video retargeting aims to synthesize a video that transfers the dance movements from a source video to a target person.

Meta-Learning

Paper
Add Code

MetaCloth: Learning Unseen Tasks of Dense Fashion Landmark Detection from a Few Samples

no code implementations • 6 Dec 2021 • Yuying Ge, Ruimao Zhang, Ping Luo

This work proposes a novel framework named MetaCloth via meta-learning, which is able to learn unseen tasks of dense fashion landmark detection with only a few annotated samples.

Meta-Learning

Paper
Add Code

Disentangled Cycle Consistency for Highly-realistic Virtual Try-On

1 code implementation • CVPR 2021 • Chongjian Ge, Yibing Song, Yuying Ge, Han Yang, Wei Liu, Ping Luo

To this end, DCTON can be naturally trained in a self-supervised manner following cycle consistency learning.

Virtual Try-on

104

Paper
Code

Parser-Free Virtual Try-on via Distilling Appearance Flows

2 code implementations • CVPR 2021 • Yuying Ge, Yibing Song, Ruimao Zhang, Chongjian Ge, Wei Liu, Ping Luo

A recent pioneering work employed knowledge distillation to reduce the dependency of human parsing, where the try-on images produced by a parser-based method are used as supervisions to train a "student" network without relying on segmentation, making the student mimic the try-on ability of the parser-based model.

Ranked #1 on Virtual Try-on on MPV

Human Parsing Knowledge Distillation +1

522

Paper
Code

DeepFashion2: A Versatile Benchmark for Detection, Pose Estimation, Segmentation and Re-Identification of Clothing Images

5 code implementations • CVPR 2019 • Yuying Ge, Ruimao Zhang, Lingyun Wu, Xiaogang Wang, Xiaoou Tang, Ping Luo

A strong baseline is proposed, called Match R-CNN, which builds upon Mask R-CNN to solve the above four tasks in an end-to-end manner.

Pose Estimation Retrieval +1

2,153

Paper
Code

SCAN: Self-and-Collaborative Attention Network for Video Person Re-identification

no code implementations • 16 Jul 2018 • Ruimao Zhang, Hongbin Sun, Jingyu Li, Yuying Ge, Liang Lin, Ping Luo, Xiaogang Wang

To address the above issues, we present a novel and practical deep architecture for video person re-identification termed Self-and-Collaborative Attention Network (SCAN).

Video-Based Person Re-Identification

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.