no code implementations • 25 Nov 2024 • Yicheng Feng, Yijiang Li, Wanpeng Zhang, Sipeng Zheng, Zongqing Lu
We present VideoOrion, a Video Large Language Model (Video-LLM) that explicitly captures the key semantic information in videos--the spatial-temporal dynamics of objects throughout the videos.
no code implementations • 4 Oct 2024 • Ye Wang, Sipeng Zheng, Bin Cao, Qianshan Wei, Qin Jin, Zongqing Lu
Inspired by the recent success of LLMs, the field of human motion understanding has increasingly shifted towards the development of large motion models.
no code implementations • 3 Oct 2024 • Wanpeng Zhang, Zilong Xie, Yicheng Feng, Yijiang Li, Xingrun Xing, Sipeng Zheng, Zongqing Lu
Multimodal Large Language Models have made significant strides in integrating visual and textual information, yet they often struggle with effectively aligning these modalities.
no code implementations • 24 Jun 2024 • Yuting Mei, Ye Wang, Sipeng Zheng, Qin Jin
As robotic agents increasingly assist humans in reality, quadruped robots offer unique opportunities for interaction in complex scenarios due to their agile movement.
1 code implementation • 28 May 2024 • Boshen Xu, Ziheng Wang, Yang Du, Zhinan Song, Sipeng Zheng, Qin Jin
Due to the occurrence of diverse EgoHOIs in the real world, we propose an open-vocabulary benchmark named EgoHOIBench to reveal the diminished performance of current egocentric video-language models (EgoVLM) on fined-grained concepts, indicating that these models still lack a full spectrum of egocentric understanding.
no code implementations • 14 Mar 2024 • Sipeng Zheng, Bohan Zhou, Yicheng Feng, Ye Wang, Zongqing Lu
In this paper, we propose \textbf{UniCode}, a novel approach within the domain of multimodal large language models (MLLMs) that learns a unified codebook to efficiently tokenize visual, text, and potentially other types of signals.
1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin
We introduce SPAFormer, an innovative model designed to overcome the combinatorial explosion challenge in the 3D Part Assembly (3D-PA) task.
1 code implementation • 9 Mar 2024 • Boshen Xu, Sipeng Zheng, Qin Jin
We humans are good at translating third-person observations of hand-object interactions (HOI) into an egocentric view.
no code implementations • 20 Oct 2023 • Sipeng Zheng, Jiazheng Liu, Yicheng Feng, Zongqing Lu
Steve-Eye integrates the LLM with a visual encoder which enables it to process visual-text inputs and generate multimodal feedback.
no code implementations • 13 Oct 2023 • Yicheng Feng, Yuxuan Wang, Jiazheng Liu, Sipeng Zheng, Zongqing Lu
Recently, various studies have leveraged Large Language Models (LLMs) to help decision-making and planning in environments, and try to align the LLMs' knowledge with the world conditions.
no code implementations • 20 Jul 2023 • Qi Zhang, Sipeng Zheng, Qin Jin
Temporal video grounding (TVG) aims to retrieve the time interval of a language query from an untrimmed video.
1 code implementation • 12 Mar 2023 • Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin
In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing.
no code implementations • CVPR 2023 • Sipeng Zheng, Boshen Xu, Qin Jin
Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life.
no code implementations • 10 Aug 2022 • Sipeng Zheng, Qi Zhang, Bei Liu, Qin Jin, Jianlong Fu
In this paper we provide the technique report of Ego4D natural language query challenge in CVPR 2022.
no code implementations • CVPR 2022 • Sipeng Zheng, ShiZhe Chen, Qin Jin
Most previous works adopt a multi-stage framework for video visual relation detection (VidVRD), which cannot capture long-term spatiotemporal contexts in different stages and also suffers from inefficiency.