no code implementations • 4 Feb 2025 • Chia-Wen Kuo, Sijie Zhu, Fan Chen, Xiaohui Shen, Longyin Wen
Large vision-and-language models (LVLMs) typically treat visual and textual embeddings as homogeneous inputs to a large language model (LLM).
no code implementations • 6 Nov 2024 • Xin Gu, Ming Li, Libo Zhang, Fan Chen, Longyin Wen, Tiejian Luo, Sijie Zhu
1) we first design a quantitative metric system based on best-in-class LVLM (Large Vision Language Model), i. e., GPT-4o in our case, to evaluate the generation quality from 3 perspectives, namely, instruction following, detail preserving, and generation quality.
no code implementations • 5 Nov 2024 • Ying Zhou, Xinyao Wang, Yulei Niu, Yaojie Shen, Lexin Tang, Fan Chen, Ben He, Le Sun, Longyin Wen
Recent advancements in large language models (LLMs) have significantly enhanced their knowledge and generative capabilities, leading to a surge of interest in leveraging LLMs for high-quality data synthesis.
1 code implementation • 13 Sep 2024 • Yaojie Shen, Xinyao Wang, Yulei Niu, Ying Zhou, Lexin Tang, Libo Zhang, Fan Chen, Longyin Wen
Despite its success, our study shows that the length exploitation issue present in PO is even more severe in Iterative Preference Optimization (IPO) due to the iterative nature of the process.
1 code implementation • 15 Jun 2024 • Lu Xu, Sijie Zhu, Chunyuan Li, Chia-Wen Kuo, Fan Chen, Xinyao Wang, Guang Chen, Dawei Du, Ye Yuan, Longyin Wen
However, a large portion of videos in real-world applications are edited videos, \textit{e. g.}, users usually cut and add effects/modifications to the raw video before publishing it on social media platforms.
1 code implementation • 9 May 2024 • Jiachen Li, Xinyao Wang, Sijie Zhu, Chia-Wen Kuo, Lu Xu, Fan Chen, Jitesh Jain, Humphrey Shi, Longyin Wen
Recent advancements in Multimodal Large Language Models (LLMs) have focused primarily on scaling by increasing text-image pair data and enhancing LLMs to improve performance on multimodal tasks.
Ranked #1 on
visual instruction following
on LLaVA-Bench
no code implementations • 24 Mar 2024 • Xin Gu, Libo Zhang, Fan Chen, Longyin Wen, YuFei Wang, Tiejian Luo, Sijie Zhu
Each video in our dataset is rendered by various image/video materials with a single editing component, which supports atomic visual understanding of different editing components.
1 code implementation • ICCV 2023 • Yaojie Shen, Xin Gu, Kai Xu, Heng Fan, Longyin Wen, Libo Zhang
Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed.
Ranked #8 on
Video Captioning
on VATEX
no code implementations • 21 Jun 2023 • YuHan Shen, Linjie Yang, Longyin Wen, Haichao Yu, Ehsan Elhamifar, Heng Wang
Recent focus in video captioning has been on designing architectures that can consume both video and text modalities, and using large-scale video datasets with text transcripts for pre-training, such as HowTo100M.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+2
no code implementations • CVPR 2023 • Xin Gu, Guang Chen, YuFei Wang, Libo Zhang, Tiejian Luo, Longyin Wen
Meanwhile, the internal stream is designed to exploit the multi-modality information in videos (e. g., the appearance of video frames, speech transcripts, and video captions) to ensure the quality of caption results.
Ranked #7 on
Video Captioning
on YouCook2
1 code implementation • 6 Mar 2023 • Wei Li, Linchao Zhu, Longyin Wen, Yi Yang
This decoder is both data-efficient and computation-efficient: 1) it only requires the text data for training, easing the burden on the collection of paired data.
1 code implementation • 7 Jul 2022 • Xin Gu, Hanhua Ye, Guang Chen, YuFei Wang, Libo Zhang, Longyin Wen
This paper describes our champion solution for the CVPR2022 Generic Event Boundary Captioning (GEBC) competition.
1 code implementation • 25 Jun 2022 • Dexiang Hong, Xiaoqi Ma, Xinyao Wang, CongCong Li, YuFei Wang, Longyin Wen
This report presents the algorithm used in the submission of Generic Event Boundary Detection (GEBD) Challenge at CVPR 2022.
no code implementations • 7 Jun 2022 • CongCong Li, Xinyao Wang, Dexiang Hong, YuFei Wang, Libo Zhang, Tiejian Luo, Longyin Wen
To capture temporal context information of each frame, we design the structure context transformer (SC-Transformer) by re-partitioning input frame sequence.
no code implementations • CVPR 2022 • CongCong Li, Xinyao Wang, Longyin Wen, Dexiang Hong, Tiejian Luo, Libo Zhang
Generic event boundary detection aims to localize the generic, taxonomy-free event boundaries that segment videos into chunks.
1 code implementation • ICCV 2021 • Boying Wang, Libo Zhang, Longyin Wen, Xianglong Liu, Yanjun Wu
Towards real-world prohibited item detection, we collect a large-scale dataset, named as PIDray, which covers various cases in real-world scenarios for prohibited item detection, especially for deliberately hidden items.
1 code implementation • 19 Jul 2021 • Dawei Du, Longyin Wen, Pengfei Zhu, Heng Fan, QinGhua Hu, Haibin Ling, Mubarak Shah, Junwen Pan, Ali Al-Ali, Amr Mohamed, Bakour Imene, Bin Dong, Binyu Zhang, Bouchali Hadia Nesma, Chenfeng Xu, Chenzhen Duan, Ciro Castiello, Corrado Mencar, Dingkang Liang, Florian Krüger, Gennaro Vessio, Giovanna Castellano, Jieru Wang, Junyu Gao, Khalid Abualsaud, Laihui Ding, Lei Zhao, Marco Cianciotta, Muhammad Saqib, Noor Almaadeed, Omar Elharrouss, Pei Lyu, Qi Wang, Shidong Liu, Shuang Qiu, Siyang Pan, Somaya Al-Maadeed, Sultan Daud Khan, Tamer Khattab, Tao Han, Thomas Golda, Wei Xu, Xiang Bai, Xiaoqing Xu, Xuelong Li, Yanyun Zhao, Ye Tian, Yingnan Lin, Yongchao Xu, Yuehan Yao, Zhenyu Xu, Zhijian Zhao, Zhipeng Luo, Zhiwei Wei, Zhiyuan Zhao
Crowd counting on the drone platform is an interesting topic in computer vision, which brings new challenges such as small object inference, background clutter and wide viewpoint.
1 code implementation • 1 Jul 2021 • Dexiang Hong, CongCong Li, Longyin Wen, Xinyao Wang, Libo Zhang
In this work, we design a Cascaded Temporal Attention Network (CASTANET) for GEBD, which is formed by three parts, the backbone network, the temporal attention module, and the classification module.
Ranked #1 on
Boundary Detection
on Kinetics-400
1 code implementation • CVPR 2021 • Longyin Wen, Dawei Du, Pengfei Zhu, QinGhua Hu, Qilong Wang, Liefeng Bo, Siwei Lyu
To promote the developments of object detection, tracking and counting algorithms in drone-captured videos, we construct a benchmark with a new drone-captured largescale dataset, named as DroneCrowd, formed by 112 video clips with 33, 600 HD frames in various scenarios.
no code implementations • 27 May 2020 • Guang Chen, Shiwen Shen, Longyin Wen, Si Luo, Liefeng Bo
Existing methods only focused on pig counting using single image, and its accuracy is challenged by several factors, including pig movements, occlusion and overlapping.
no code implementations • ECCV 2020 • Cong-Cong Li, Dawei Du, Libo Zhang, Longyin Wen, Tiejian Luo, Yanjun Wu, Pengfei Zhu
Specifically, we first build the spatial pyramid representation to capture context information of objects at different scales.
1 code implementation • 18 Mar 2020 • Yuan-Qiang Cai, Longyin Wen, Libo Zhang, Dawei Du, Weiqiang Wang
In this paper, we propose a new task, ie, simultaneously object localization and counting, abbreviated as Locount, which requires algorithms to localize groups of objects of interest with the number of instances.
1 code implementation • 16 Mar 2020 • Pengfei Zhu, Jiayu Zheng, Dawei Du, Longyin Wen, Yiming Sun, QinGhua Hu
Moreover, an agent sharing network (ASNet) is proposed by self-supervised template sharing and view-aware fusion of the target from multiple drones, which can improve the tracking accuracy significantly compared with single drone tracking.
2 code implementations • 16 Jan 2020 • Pengfei Zhu, Longyin Wen, Dawei Du, Xiao Bian, Heng Fan, QinGhua Hu, Haibin Ling
We provide a large-scale drone captured dataset, VisDrone, which includes four tracks, i. e., (1) image object detection, (2) video object detection, (3) single object tracking, and (4) multi-object tracking.
no code implementations • ECCV 2020 • Ruyi Ji, Dawei Du, Libo Zhang, Longyin Wen, Yanjun Wu, Chen Zhao, Feiyue Huang, Siwei Lyu
In this paper, we design a novel semantic neural tree for human parsing, which uses a tree architecture to encode physiological structure of human body, and designs a coarse to fine process in a cascade manner to generate accurate results.
no code implementations • 11 Dec 2019 • Wenzhang Zhou, Longyin Wen, Libo Zhang, Dawei Du, Tiejian Luo, Yanjun Wu
To reduce the impact of manually designed anchor boxes to adapt to different target motion patterns, we design the localization branch, which aims to coarsely localize the target to help the regression branch to generate accurate results.
1 code implementation • 4 Dec 2019 • Longyin Wen, Dawei Du, Pengfei Zhu, QinGhua Hu, Qilong Wang, Liefeng Bo, Siwei Lyu
This paper proposes a space-time multi-scale attention network (STANet) to solve density map estimation, localization and tracking in dense crowds of video clips captured by drones with arbitrary crowd density, perspective, and flight altitude.
1 code implementation • International Conference on Computer Vision Workshops 2019 • Dawei Du, Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Lin, QinGhua Hu, Tao Peng, Jiayu Zheng, Xinyao Wang, Yue Zhang, Liefeng Bo, Hailin Shi, Rui Zhu, Aashish Kumar, Aijin Li, Almaz Zinollayev, Anuar Askergaliyev, Arne Schumann, Binjie Mao, Byeongwon Lee, Chang Liu, Changrui Chen, Chunhong Pan, Chunlei Huo, Da Yu, Dechun Cong, Dening Zeng, Dheeraj Reddy Pailla, Di Li, Dong Wang, Donghyeon Cho, Dongyu Zhang, Furui Bai, George Jose, Guangyu Gao, Guizhong Liu, Haitao Xiong, Hao Qi, Haoran Wang, Heqian Qiu, Hongliang Li, Huchuan Lu, Ildoo Kim, Jaekyum Kim, Jane Shen, Jihoon Lee, Jing Ge, Jingjing Xu, Jingkai Zhou, Jonas Meier, Jun Won Choi, Junhao Hu, Junyi Zhang, Junying Huang, Kaiqi Huang, Keyang Wang, Lars Sommer, Lei Jin, Lei Zhang
Results of 33 object detection algorithms are presented.
2 code implementations • CVPR 2020 • Ruyi Ji, Longyin Wen, Libo Zhang, Dawei Du, Yanjun Wu, Chen Zhao, Xianglong Liu, Feiyue Huang
Specifically, we incorporate convolutional operations along edges of the tree structure, and use the routing functions in each node to determine the root-to-leaf computational paths within the tree.
Ranked #37 on
Fine-Grained Image Classification
on Stanford Cars
Fine-Grained Image Classification
Fine-Grained Visual Categorization
no code implementations • 25 Sep 2019 • Yuan-Qiang Cai, Dawei Du, Libo Zhang, Longyin Wen, Weiqiang Wang, Yanjun Wu, Siwei Lyu
Object detection and counting are related but challenging problems, especially for drone based scenes with small objects and cluttered background.
no code implementations • 29 Jul 2019 • Jun Wan, Chi Lin, Longyin Wen, Yunan Li, Qiguang Miao, Sergio Escalera, Gholamreza Anbarjafari, Isabelle Guyon, Guodong Guo, Stan Z. Li
The ChaLearn large-scale gesture recognition challenge has been run twice in two workshops in conjunction with the International Conference on Pattern Recognition (ICPR) 2016 and International Conference on Computer Vision (ICCV) 2017, attracting more than $200$ teams round the world.
no code implementations • 10 Apr 2019 • Cong-Cong Li, Dawei Du, Libo Zhang, Tiejian Luo, Yanjun Wu, Qi Tian, Longyin Wen, Siwei Lyu
In this paper, we propose a new data priming method to solve the domain adaptation problem.
1 code implementation • CVPR 2019 • Kai Xu, Longyin Wen, Guorong Li, Liefeng Bo, Qingming Huang
Specifically, the temporal coherence branch pretrained in an adversarial fashion from unlabeled video data, is designed to capture the dynamic appearance and motion cues of video sequences to guide object segmentation.
Ranked #2 on
Semi-Supervised Video Object Segmentation
on YouTube
no code implementations • 10 Dec 2018 • Longyin Wen, Dawei Du, Shengkun Li, Xiao Bian, Siwei Lyu
The majority of Multi-Object Tracking (MOT) algorithms based on the tracking-by-detection scheme do not use higher order dependencies among objects or tracklets, which makes them less effective in handling complex scenarios.
no code implementations • 6 Nov 2018 • Wenbo Li, Longyin Wen, Xiao Bian, Siwei Lyu
Video style transfer is a useful component for applications such as augmented reality, non-photorealistic rendering, and interactive games.
1 code implementation • CVPR 2019 • Rui Zhu, Shifeng Zhang, Xiaobo Wang, Longyin Wen, Hailin Shi, Liefeng Bo, Tao Mei
Taking this advantage, we are able to explore various types of networks for object detection, without suffering from the poor convergence.
no code implementations • ECCV 2018 • Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, Stan Z. Li
Pedestrian detection in crowded scenes is a challenging problem since the pedestrians often gather together and occlude each other.
Ranked #10 on
Pedestrian Detection
on Caltech
(using extra training data)
no code implementations • 20 Apr 2018 • Pengfei Zhu, Longyin Wen, Xiao Bian, Haibin Ling, QinGhua Hu
In this paper we present a large-scale visual object detection and tracking benchmark, named VisDrone2018, aiming at advancing visual understanding tasks on the drone platform.
11 code implementations • CVPR 2018 • Shifeng Zhang, Longyin Wen, Xiao Bian, Zhen Lei, Stan Z. Li
For object detection, the two-stage approach (e. g., Faster R-CNN) has been achieving the highest accuracy, whereas the one-stage approach (e. g., SSD) has the advantage of high efficiency.
Ranked #177 on
Object Detection
on COCO test-dev
no code implementations • The IEEE International Conference on Computer Vision (ICCV), 2017 2017 • Wenbo Li, Longyin Wen, Ming-Ching Chang, Ser Nam Lim, Siwei Lyu
The RNNs in RNN-T are co-trained with the action category hierarchy, which determines the structure of RNN-T.
Ranked #118 on
Skeleton Based Action Recognition
on NTU RGB+D
no code implementations • 13 Jun 2017 • Longyin Wen, Honggang Qi, Siwei Lyu
Our method recovers the original pixel histogram and the contrast enhancement simultaneously from a single image with an iterative algorithm.
no code implementations • NeurIPS 2016 • Yiming Ying, Longyin Wen, Siwei Lyu
From this saddle representation, a stochastic online algorithm (SOLAM) is proposed which has time and space complexity of one datum.
no code implementations • 18 Mar 2016 • Dawei Du, Honggang Qi, Longyin Wen, Qi Tian, Qingming Huang, Siwei Lyu
Graph based representation is widely used in visual tracking field by finding correct correspondences between target parts in consecutive frames.
no code implementations • ICCV 2015 • Wenbo Li, Longyin Wen, Mooi Choo Chuah, Siwei Lyu
In this paper, we propose the category-blind human recognition method (CHARM) which can recognize a human action without making assumptions of the action category.
no code implementations • 13 Nov 2015 • Longyin Wen, Dawei Du, Zhaowei Cai, Zhen Lei, Ming-Ching Chang, Honggang Qi, Jongwoo Lim, Ming-Hsuan Yang, Siwei Lyu
In this work, we perform a comprehensive quantitative study on the effects of object detection accuracy to the overall MOT performance, using the new large-scale University at Albany DETection and tRACking (UA-DETRAC) benchmark dataset.
no code implementations • CVPR 2015 • Longyin Wen, Dawei Du, Zhen Lei, Stan Z. Li, Ming-Hsuan Yang
We present a novel Joint Online Tracking and Segmentation (JOTS) algorithm which integrates the multi-part tracking and segmentation into a unified energy optimization framework to handle the video segmentation task.
no code implementations • CVPR 2014 • Longyin Wen, Wenbo Li, Junjie Yan, Zhen Lei, Dong Yi, Stan Z. Li
Multi-target tracking is an interesting but challenging task in computer vision field.
no code implementations • CVPR 2014 • Junjie Yan, Zhen Lei, Longyin Wen, Stan Z. Li
Three prohibitive steps in cascade version of DPM are accelerated, including 2D correlation between root filter and feature map, cascade part pruning and HOG feature extraction.
no code implementations • CVPR 2014 • Menglong Yang, Yiguang Liu, Longyin Wen, Zhisheng You, Stan Z. Li
Mutual occlusions among targets can cause track loss or target position deviation, because the observation likelihood of an occluded target may vanish even when we have the estimated location of the target.