Search Results for author: Chen Sun

Found 76 papers, 38 papers with code

Evaluating the Generation Capabilities of Large Chinese Language Models

2 code implementations9 Aug 2023 Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang

This paper presents CG-Eval, the first comprehensive evaluation of the generation capabilities of large Chinese language models across a wide range of academic disciplines.

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

no code implementations31 Jul 2023 Qi Zhao, Ce Zhang, Shijie Wang, Changcheng Fu, Nakul Agarwal, Kwonjoon Lee, Chen Sun

We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal.

Action Anticipation Long Term Action Anticipation

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations17 Jul 2023 Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification object-detection +2

NLOS Dies Twice: Challenges and Solutions of V2X for Cooperative Perception

no code implementations13 Jul 2023 Lantao Li, Chen Sun

Multi-agent multi-lidar sensor fusion between connected vehicles for cooperative perception has recently been recognized as the best technique for minimizing the blind zone of individual vehicular perception systems and further enhancing the overall safety of autonomous driving systems.

Autonomous Driving Sensor Fusion

Federated Learning over a Wireless Network: Distributed User Selection through Random Access

no code implementations7 Jul 2023 Chen Sun, Shiyao Ma, Ce Zheng, Songtao Wu, Tao Cui, Lingjuan Lyu

This study proposes a network intrinsic approach of distributed user selection that leverages the radio resource competition mechanism in random access.

Fairness Federated Learning

Goal-Conditioned Predictive Coding as an Implicit Planner for Offline Reinforcement Learning

no code implementations7 Jul 2023 Zilai Zeng, Ce Zhang, Shijie Wang, Chen Sun

Recent work has demonstrated the effectiveness of formulating decision making as a supervised learning problem on offline-collected trajectories.

Decision Making Offline RL

How can objects help action recognition?

1 code implementation CVPR 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition

Dense Video Object Captioning from Disjoint Supervision

1 code implementation20 Jun 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We show our task is more general than grounding, and models trained on our task can directly be applied to grounding by finding the bounding box with the maximum likelihood of generating the query sentence.

Video Grounding

2nd Place Winning Solution for the CVPR2023 Visual Anomaly and Novelty Detection Challenge: Multimodal Prompting for Data-centric Anomaly Detection

no code implementations15 Jun 2023 Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Liang Gao, Weiming Shen

This technical report introduces the winning solution of the team Segment Any Anomaly for the CVPR2023 Visual Anomaly and Novelty Detection (VAND) challenge.

Anomaly Detection

AVIS: Autonomous Visual Information Seeking with Large Language Models

no code implementations13 Jun 2023 Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A Ross, Cordelia Schmid, Alireza Fathi

Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions.

Decision Making Language Modelling +3

Segment Any Anomaly without Training via Hybrid Prompt Regularization

2 code implementations18 May 2023 Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Zongwei Du, Liang Gao, Weiming Shen

We present a novel framework, i. e., Segment Any Anomaly + (SAA+), for zero-shot anomaly segmentation with hybrid prompt regularization to improve the adaptability of modern foundation models.

Anomaly Detection

End-to-End Spatio-Temporal Action Localisation with Video Transformers

no code implementations24 Apr 2023 Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab

The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.

 Ranked #1 on Action Recognition on AVA v2.1 (using extra training data)

Action Detection Action Recognition +1

Comparing Trajectory and Vision Modalities for Verb Representation

no code implementations8 Mar 2023 Dylan Ebert, Chen Sun, Ellie Pavlick

Given the importance of 3D space in formal models of verb semantics, we expect that these 2D images would result in impoverished representations that fail to capture nuanced differences in meaning.

Representation Learning

Steerable Equivariant Representation Learning

no code implementations22 Feb 2023 Sangnie Bhardwaj, Willie McClinton, Tongzhou Wang, Guillaume Lajoie, Chen Sun, Phillip Isola, Dilip Krishnan

In this paper, we propose a method of learning representations that are instead equivariant to data augmentations.

Image Retrieval object-detection +5

DEJA VU: Continual Model Generalization For Unseen Domains

2 code implementations25 Jan 2023 Chenxi Liu, Lixu Wang, Lingjuan Lyu, Chen Sun, Xiao Wang, Qi Zhu

To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains.

Data Augmentation Domain Generalization

ConSpec: honing in on critical steps for rapid learning and generalization in RL

1 code implementation12 Oct 2022 Chen Sun, Wannan Yang, Thomas Jiralerspong, Dane Malenfant, Benjamin Alsbury-Nealy, Yoshua Bengio, Blake Richards

These critical steps are challenging to identify with traditional reinforcement learning (RL) methods that rely on the Bellman equation for credit assignment.

Continuous Control Contrastive Learning +3

A New Knowledge Distillation Network for Incremental Few-Shot Surface Defect Detection

1 code implementation1 Sep 2022 Chen Sun, Liang Gao, Xinyu Li, Yiping Gao

The proposed DKAN method follows a pretraining-finetuning transfer learning paradigm and a knowledge distillation framework is designed for fine-tuning.

Defect Detection Knowledge Distillation +1

Do Trajectories Encode Verb Meaning?

no code implementations NAACL 2022 Dylan Ebert, Chen Sun, Ellie Pavlick

Distributional models learn representations of words from text, but are criticized for their lack of grounding, or the linking of text to the non-linguistic world.

Representation Learning

AVATAR: Unconstrained Audiovisual Speech Recognition

1 code implementation15 Jun 2022 Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Learning Audio-Video Modalities from Image Captions

no code implementations1 Apr 2022 Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Ranked #19 on Zero-Shot Video Retrieval on MSR-VTT (using extra training data)

Image Captioning Retrieval +3

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

no code implementations31 Mar 2022 Tian Yun, Usha Bhalla, Ellie Pavlick, Chen Sun

CompMap first asks a VL model to generate primitive concept activations with text prompts, and then learns to construct a composition model that maps the primitive concept activations (e. g. the likelihood of black tail or red wing) to composite concepts (e. g. a red-winged blackbird).

Fine-Grained Visual Recognition Zero-Shot Learning

Trajectory balance: Improved credit assignment in GFlowNets

2 code implementations31 Jan 2022 Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, Yoshua Bengio

Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object.

Multiview Transformers for Video Recognition

1 code implementation CVPR 2022 Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.

Ranked #4 on Action Recognition on EPIC-KITCHENS-100 (using extra training data)

Action Classification Action Recognition +1

Masking Modalities for Cross-modal Video Retrieval

no code implementations1 Nov 2021 Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Retrieval Video Retrieval

Does Vision-and-Language Pretraining Improve Lexical Grounding?

1 code implementation Findings (EMNLP) 2021 Tian Yun, Chen Sun, Ellie Pavlick

Linguistic representations derived from text alone have been criticized for their lack of grounding, i. e., connecting words to their meanings in the physical world.

Question Answering Visual Question Answering

DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets

2 code implementations ICCV 2021 Junru Gu, Chen Sun, Hang Zhao

In this work, we propose an anchor-free and end-to-end trajectory prediction model, named DenseTNT, that directly outputs a set of trajectories from dense goal candidates.

Motion Forecasting motion prediction +1

Discrete-Valued Neural Communication

no code implementations NeurIPS 2021 Dianbo Liu, Alex Lamb, Kenji Kawaguchi, Anirudh Goyal, Chen Sun, Michael Curtis Mozer, Yoshua Bengio

Deep learning has advanced from fully connected architectures to structured models organized into components, e. g., the transformer composed of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes.

Quantization Systematic Generalization

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Action Classification Action Recognition +2

Episodic Transformer for Vision-and-Language Navigation

1 code implementation ICCV 2021 Alexander Pashevich, Cordelia Schmid, Chen Sun

We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.

Vision and Language Navigation

Unified Graph Structured Models for Video Understanding

no code implementations ICCV 2021 Anurag Arnab, Chen Sun, Cordelia Schmid

Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.

Action Detection Graph Classification +3

ViViT: A Video Vision Transformer

6 code implementations ICCV 2021 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Ranked #8 on Action Classification on Moments in Time (Top 5 Accuracy metric, using extra training data)

Action Classification Action Recognition +4

Learning Temporal Dynamics from Cycles in Narrated Video

no code implementations ICCV 2021 Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.

Multi-modal Transformer for Video Retrieval

1 code implementation ECCV 2020 Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.

 Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT (text-to-video Mean Rank metric, using extra training data)

Natural Language Queries Retrieval +2

What Makes for Good Views for Contrastive Learning?

1 code implementation NeurIPS 2020 Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.

Contrastive Learning Data Augmentation +8

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

3 code implementations CVPR 2020 Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).

Self-Driving Cars

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations CVPR 2020 Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Automated Pyramid Summarization Evaluation

1 code implementation CONLL 2019 Yanjun Gao, Chen Sun, Rebecca J. Passonneau

Pyramid evaluation was developed to assess the content of paragraph length summaries of source texts.

Learning Video Representations using Contrastive Bidirectional Transformer

no code implementations13 Jun 2019 Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Modeling Parts, Structure, and System Dynamics via Predictive Learning

no code implementations ICLR 2019 Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.

Predicting the Present and Future States of Multi-agent Systems from Partially-observed Visual Data

no code implementations ICLR 2019 Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy

We present a method which learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.

Affordance Learning In Direct Perception for Autonomous Driving

no code implementations20 Mar 2019 Chen Sun, Jean M. Uwabeza Vianney, Dongpu Cao

Our results indicate that this method could act as a cheaper way for training data collection in autonomous driving.

Autonomous Driving road scene understanding

Unsupervised Discovery of Parts, Structure, and Dynamics

no code implementations12 Mar 2019 Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.

Stochastic Prediction of Multi-Agent Interactions from Partial Observations

no code implementations25 Feb 2019 Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy

We present a method that learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.

D3D: Distilled 3D Networks for Video Action Recognition

1 code implementation19 Dec 2018 Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar

State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input.

Action Classification Action Recognition +2

Composing Text and Image for Image Retrieval - An Empirical Odyssey

4 code implementations CVPR 2019 Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays

In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image.

Image Retrieval Image Retrieval with Multi-Modal Query +1

Actor-Centric Relation Network

1 code implementation ECCV 2018 Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Action Classification Action Detection +4

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning

1 code implementation CVPR 2018 Yin Cui, Yang song, Chen Sun, Andrew Howard, Serge Belongie

We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure.

Fine-Grained Image Classification Fine-Grained Visual Categorization +1

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

1 code implementation ECCV 2018 Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, Kevin Murphy

Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification.

Ranked #24 on Action Recognition on UCF101 (using extra training data)

Action Classification Action Detection +6

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

1 code implementation ICCV 2017 Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, Boqing Gong

Many seemingly distant annotations (e. g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e. g., of COCO).

Language Modelling Multiple-choice +3

The iNaturalist Species Classification and Detection Dataset

13 code implementations CVPR 2018 Grant Van Horn, Oisin Mac Aodha, Yang song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie

Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories.

General Classification Image Classification

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

6 code implementations CVPR 2018 Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Actin Detection Action Detection +3

TALL: Temporal Activity Localization via Language Query

9 code implementations ICCV 2017 Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia

For evaluation, we adopt TaCoS dataset, and build a new dataset for this task on top of Charades by adding sentence temporal annotations, called Charades-STA.

Natural Language Queries regression +1

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals

1 code implementation ICCV 2017 Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, Ram Nevatia

Temporal Action Proposal (TAP) generation is an important problem, as fast and accurate extraction of semantically important (e. g. human actions) segments from untrimmed videos is an important step for large-scale video analysis.

regression Temporal Action Localization

Complex Event Recognition from Images with Few Training Examples

no code implementations17 Jan 2017 Unaiza Ahsan, Chen Sun, James Hays, Irfan Essa

We propose to leverage concept-level representations for complex event recognition in photographs given limited training examples.

ACD: Action Concept Discovery from Image-Sentence Corpora

no code implementations16 Apr 2016 Jiyang Gao, Chen Sun, Ram Nevatia

It obtains candidate action concepts by extracting verb-object pairs from sentences and verifies their visualness with the associated images.

Action Classification Classification +1

Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

1 code implementation4 Apr 2015 Chen Sun, Sanketh Shetty, Rahul Sukthankar, Ram Nevatia

To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output.

Action Recognition Temporal Action Localization +1

DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting

no code implementations CVPR 2014 Chen Sun, Ram Nevatia

Our goal is to find the important segments and capture their information for event classification and recounting.

General Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.