Search Results for author: Weihan Wang

Found 16 papers, 8 papers with code

MotionBench: Benchmarking and Improving Fine-grained Video Motion Understanding for Vision Language Models

no code implementations6 Jan 2025 Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, Jie Tang

To address this gap, we propose MotionBench, a comprehensive evaluation benchmark designed to assess the fine-grained motion comprehension of video understanding models.

Benchmarking Feature Compression +1

MathGLM-Vision: Solving Mathematical Problems with Multi-Modal Large Language Model

no code implementations10 Sep 2024 Zhen Yang, Jinhao Chen, Zhengxiao Du, Wenmeng Yu, Weihan Wang, Wenyi Hong, Zhihuan Jiang, Bin Xu, Jie Tang

Large language models (LLMs) have demonstrated significant capabilities in mathematical reasoning, particularly with text-based mathematical problems.

Diversity Language Modeling +3

VisScience: An Extensive Benchmark for Evaluating K12 Educational Multi-modal Scientific Reasoning

no code implementations10 Sep 2024 Zhihuan Jiang, Zhen Yang, Jinhao Chen, Zhengxiao Du, Weihan Wang, Bin Xu, Jie Tang

To address this gap, we meticulously construct a comprehensive benchmark, named VisScience, which is utilized to assess the multi-modal scientific reasoning across the three disciplines of mathematics, physics, and chemistry.

Question Answering Visual Question Answering

CogVLM2: Visual Language Models for Image and Video Understanding

3 code implementations29 Aug 2024 Wenyi Hong, Weihan Wang, Ming Ding, Wenmeng Yu, Qingsong Lv, Yan Wang, Yean Cheng, Shiyu Huang, Junhui Ji, Zhao Xue, Lei Zhao, Zhuoyi Yang, Xiaotao Gu, Xiaohan Zhang, Guanyu Feng, Da Yin, Zihan Wang, Ji Qi, Xixuan Song, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Yuxiao Dong, Jie Tang

Beginning with VisualGLM and CogVLM, we are continuously exploring VLMs in pursuit of enhanced vision-language fusion, efficient higher-resolution architecture, and broader modalities and applications.

MM-Vet MVBench +3

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

1 code implementation12 Aug 2024 Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, Da Yin, Xiaotao Gu, Yuxuan Zhang, Weihan Wang, Yean Cheng, Ting Liu, Bin Xu, Yuxiao Dong, Jie Tang

We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels.

Text-to-Video Generation Video Alignment +2

VIPeR: Visual Incremental Place Recognition with Adaptive Mining and Lifelong Learning

no code implementations31 Jul 2024 Yuhang Ming, Minyang Xu, Xingrui Yang, Weicai Ye, Weihan Wang, Yong Peng, Weichen Dai, Wanzeng Kong

Then, to prevent catastrophic forgetting in lifelong learning, we draw inspiration from human memory systems and design a novel memory bank for our VIPeR.

Knowledge Distillation Visual Place Recognition

CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

1 code implementation6 Feb 2024 Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang

Drawing inspiration from human cognition in solving visual problems (e. g., marking, zoom in), this paper introduces Chain of Manipulations, a mechanism that enables VLMs to solve problems step-by-step with evidence.

Visual Reasoning

PlanarNeRF: Online Learning of Planar Primitives with Neural Radiance Fields

no code implementations30 Dec 2023 Zheng Chen, Qingan Yan, Huangying Zhan, Changjiang Cai, Xiangyu Xu, Yuzhong Huang, Weihan Wang, Ziyue Feng, Lantao Liu, Yi Xu

Through extensive experiments, we demonstrate the effectiveness of PlanarNeRF in various scenarios and remarkable improvement over existing works.

3D Plane Detection

CogAgent: A Visual Language Model for GUI Agents

3 code implementations CVPR 2024 Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang

People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs), e. g., computer or smartphone screens.

Ranked #4 on on

Language Modeling +5

ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation

no code implementations ICCV 2023 Weihan Wang, Zhen Yang, Bin Xu, Juanzi Li, Yankui Sun

Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks.

Image-text matching Language Modeling +3

EDI: ESKF-based Disjoint Initialization for Visual-Inertial SLAM Systems

no code implementations4 Aug 2023 Weihan Wang, Jiani Li, Yuhang Ming, Philippos Mordohai

Our method incorporates an Error-state Kalman Filter (ESKF) to estimate gyroscope bias and correct rotation estimates from monocular SLAM, overcoming dependence on pure monocular SLAM for rotation estimation.

Real-Time Dense 3D Mapping of Underwater Environments

2 code implementations5 Apr 2023 Weihan Wang, Bharat Joshi, Nathaniel Burgdorfer, Konstantinos Batsos, Alberto Quattrini Li, Philippos Mordohai, Ioannis Rekleitis

To address this problem, we propose to use SVIn2, a robust VIO method, together with a real-time 3D reconstruction pipeline.

3D Reconstruction

Cannot find the paper you are looking for? You can Submit a new open access paper.