Search Results for author: Xingyi Zhou

Found 24 papers, 18 papers with code

STT: Stateful Tracking with Transformers for Autonomous Driving

no code implementations30 Apr 2024 Longlong Jing, Ruichi Yu, Xu Chen, Zhengli Zhao, Shiwei Sheng, Colin Graber, Qi Chen, Qinru Li, Shangxuan Wu, Han Deng, Sangjin Lee, Chris Sweeney, Qiurui He, Wei-Chih Hung, Tong He, Xingyi Zhou, Farshid Moussavi, Zijian Guo, Yin Zhou, Mingxing Tan, Weilong Yang, CongCong Li

In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately.

Autonomous Driving

Streaming Dense Video Captioning

1 code implementation1 Apr 2024 Xingyi Zhou, Anurag Arnab, Shyamal Buch, Shen Yan, Austin Myers, Xuehan Xiong, Arsha Nagrani, Cordelia Schmid

An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos, predict rich, detailed textual descriptions, and be able to produce outputs before processing the entire video.

Dense Video Captioning

Pixel Aligned Language Models

no code implementations14 Dec 2023 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modelling

MaskConver: Revisiting Pure Convolution Model for Panoptic Segmentation

1 code implementation11 Dec 2023 Abdullah Rashwan, Jiageng Zhang, Ali Taalimi, Fan Yang, Xingyi Zhou, Chaochao Yan, Liang-Chieh Chen, Yeqing Li

With ResNet50 backbone, our MaskConver achieves 53. 6% PQ on the COCO panoptic val set, outperforming the modern convolution-based model, Panoptic FCN, by 9. 3% as well as transformer-based models such as Mask2Former (+1. 7% PQ) and kMaX-DeepLab (+0. 6% PQ).

Decoder Panoptic Segmentation

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations NeurIPS 2023 Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification Object +3

Dense Video Object Captioning from Disjoint Supervision

1 code implementation20 Jun 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.

Object Sentence +2

How can objects help action recognition?

1 code implementation CVPR 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition Object

NMS Strikes Back

1 code implementation12 Dec 2022 Jeffrey Ouyang-Zhang, Jang Hyun Cho, Xingyi Zhou, Philipp Krähenbühl

Our detector that trains Deformable-DETR with traditional IoU-based label assignment achieved 50. 2 COCO mAP within 12 epochs (1x schedule) with ResNet50 backbone, outperforming all existing traditional or transformer-based detectors in this setting.

Attribute object-detection +1

Global Tracking Transformers

1 code implementation CVPR 2022 Xingyi Zhou, Tianwei Yin, Vladlen Koltun, Philipp Krähenbühl

The transformer encodes object features from all frames, and uses trajectory queries to group them into trajectories.

Ranked #13 on Multi-Object Tracking on SportsMOT (using extra training data)

Multi-Object Tracking Object

Detecting Twenty-thousand Classes using Image-level Supervision

1 code implementation7 Jan 2022 Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Krähenbühl, Ishan Misra

For the first time, we train a detector with all the twenty-one-thousand classes of the ImageNet dataset and show that it generalizes to new datasets without finetuning.

Image Classification Open Vocabulary Object Detection

Multimodal Virtual Point 3D Detection

1 code implementation NeurIPS 2021 Tianwei Yin, Xingyi Zhou, Philipp Krähenbühl

For autonomous driving, this means that large objects close to the sensors are easily visible, but far-away or small objects comprise only one measurement or two.

3D Object Detection Autonomous Driving

Learning a unified label space

no code implementations1 Jan 2021 Xingyi Zhou, Vladlen Koltun, Philipp Kraehenbuehl

These labels span many diverse datasets with potentially inconsistent semantic labels.

Instance Segmentation Object +3

Tracking Objects as Points

7 code implementations ECCV 2020 Xingyi Zhou, Vladlen Koltun, Philipp Krähenbühl

Nowadays, tracking is dominated by pipelines that perform object detection followed by temporal association, also known as tracking-by-detection.

Multi-Object Tracking Multiple Object Tracking +2

StarMap for Category-Agnostic Keypoint and Viewpoint Estimation

1 code implementation ECCV 2018 Xingyi Zhou, Arjun Karpur, Linjie Luo, Qi-Xing Huang

Existing methods define semantic keypoints separately for each category with a fixed number of semantic labels in fixed indices.

Keypoint Detection Viewpoint Estimation

Unsupervised Domain Adaptation for 3D Keypoint Estimation via View Consistency

1 code implementation ECCV 2018 Xingyi Zhou, Arjun Karpur, Chuang Gan, Linjie Luo, Qi-Xing Huang

In this paper, we introduce a novel unsupervised domain adaptation technique for the task of 3D keypoint prediction from a single depth scan or image.

Keypoint Estimation Unsupervised Domain Adaptation

Deep Kinematic Pose Regression

no code implementations17 Sep 2016 Xingyi Zhou, Xiao Sun, Wei zhang, Shuang Liang, Yichen Wei

In this work, we propose to directly embed a kinematic object model into the deep neutral network learning for general articulated object pose estimation.

3D Human Pose Estimation Object +2

Model-based Deep Hand Pose Estimation

1 code implementation22 Jun 2016 Xingyi Zhou, Qingfu Wan, Wei zhang, xiangyang xue, Yichen Wei

For the first time, we show that embedding such a non-linear generative process in deep learning is feasible for hand pose estimation.

Hand Pose Estimation valid

Cannot find the paper you are looking for? You can Submit a new open access paper.