Search Results for author: Sipeng Zheng

Found 15 papers, 4 papers with code

VideoOrion: Tokenizing Object Dynamics in Videos

no code implementations25 Nov 2024 Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Zongqing Lu

We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos--the spatial-temporal dynamics of objects throughout the videos.

Language Modeling Language Modelling +4

Quo Vadis, Motion Generation? From Large Language Models to Large Motion Models

no code implementations4 Oct 2024 Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Qin Jin, Zongqing Lu

Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted towards the development of large motion models.

Motion Generation

From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities

no code implementations3 Oct 2024 Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu

Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities.

QuadrupedGPT: Towards a Versatile Quadruped Agent in Open-ended Worlds

no code implementations24 Jun 2024 Yuting Mei, Ye Wang, Sipeng Zheng, Qin Jin

As robotic agents increasingly assist humans in reality, quadruped robots offer unique opportunities for interaction in complex scenarios due to their agile movement.

Decision Making Navigate

EgoNCE++: Do Egocentric Video-Language Models Really Understand Hand-Object Interactions?

1 code implementation28 May 2024 Boshen Xu, Ziheng Wang, Yang Du, Zhinan Song, Sipeng Zheng, Qin Jin

Due to the occurrence of diverse EgoHOIs in the real world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal the diminished performance of current egocentric video-language models (EgoVLM) on fined-grained concepts, indicating that these models still lack a full spectrum of egocentric understanding.

Action Recognition Attribute +2

UniCode: Learning a Unified Codebook for Multimodal Large Language Models

no code implementations14 Mar 2024 Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu

In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals.

Quantization Visual Question Answering (VQA)

SPAFormer: Sequential 3D Part Assembly with Transformers

1 code implementation9 Mar 2024 Boshen Xu, Sipeng Zheng, Qin Jin

We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task.

POV: Prompt-Oriented View-Agnostic Learning for Egocentric Hand-Object Interaction in the Multi-View World

1 code implementation9 Mar 2024 Boshen Xu, Sipeng Zheng, Qin Jin

We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view.

Steve-Eye: Equipping LLM-based Embodied Agents with Visual Perception in Open Worlds

no code implementations20 Oct 2023 Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu

Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback.

LLaMA Rider: Spurring Large Language Models to Explore the Open World

no code implementations13 Oct 2023 Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu

Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.

Decision Making Minecraft +1

No-frills Temporal Video Grounding: Multi-Scale Neighboring Attention and Zoom-in Boundary Detection

no code implementations20 Jul 2023 Qi Zhang, Sipeng Zheng, Qin Jin

Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video.

Boundary Detection Video Grounding

Accommodating Audio Modality in CLIP for Multimodal Processing

1 code implementation12 Mar 2023 Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.

AudioCaps Contrastive Learning +5

Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework

no code implementations CVPR 2023 Sipeng Zheng, Boshen Xu, Qin Jin

Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life.

Human-Object Interaction Detection Language Modeling +2

Exploring Anchor-based Detection for Ego4D Natural Language Query

no code implementations10 Aug 2022 Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu

In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.

Video Understanding

VRDFormer: End-to-End Video Visual Relation Detection With Transformers

no code implementations CVPR 2022 Sipeng Zheng, ShiZhe Chen, Qin Jin

Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.

Object Relation +3

Cannot find the paper you are looking for? You can Submit a new open access paper.