Search Results for author: Chen Sun

Found 106 papers, 51 papers with code

RG-Attn: Radian Glue Attention for Multi-modality Multi-agent Cooperative Perception

no code implementations28 Jan 2025 Lantao Li, Kang Yang, Wenqi Zhang, Xiaoxue Wang, Chen Sun

To harness the potential of every possible data source for optimal performance, we design a robust LiDAR and camera cross-modality fusion module, Radian-Glue-Attention (RG-Attn), applicable to both intra-agent cross-modality fusion and inter-agent cross-modality fusion scenarios, owing to the convenient coordinate conversion by transformation matrix and the unified sampling/inversion mechanism.

Hybrid Action Based Reinforcement Learning for Multi-Objective Compatible Autonomous Driving

no code implementations14 Jan 2025 Guizhe Jin, Zhuoren Li, Bo Leng, Wei Han, Lu Xiong, Chen Sun

To this end, we propose a Multi-objective Ensemble-Critic reinforcement learning method with Hybrid Parametrized Action for multi-objective compatible autonomous driving.

Attribute Autonomous Driving +1

MotiF: Making Text Count in Image Animation with Motion Focal Loss

no code implementations20 Dec 2024 Shijie Wang, Samaneh Azadi, Rohit Girdhar, Saketh Rambhatla, Chen Sun, Xi Yin

To overcome this limitation, we introduce MotiF, a simple yet effective approach that directs the model's learning to the regions with more motion, thereby improving the text alignment and motion generation.

Image Animation Motion Generation +1

Motion Prompting: Controlling Video Generation with Motion Trajectories

no code implementations3 Dec 2024 Daniel Geng, Charles Herrmann, Junhwa Hur, Forrester Cole, Serena Zhang, Tobias Pfaff, Tatiana Lopez-Guevara, Carl Doersch, Yusuf Aytar, Michael Rubinstein, Chen Sun, Oliver Wang, Andrew Owens, Deqing Sun

Motion control is crucial for generating expressive and compelling video content; however, most existing video generation models rely mainly on text prompts for control, which struggle to capture the nuances of dynamic actions and temporal compositions.

Video Generation

$100K or 100 Days: Trade-offs when Pre-Training with Academic Resources

1 code implementation30 Oct 2024 Apoorv Khandelwal, Tian Yun, Nihal V. Nayak, Jack Merullo, Stephen H. Bach, Chen Sun, Ellie Pavlick

We introduce a benchmark to measure the time to pre-train models on given GPUs and also identify ideal settings for maximizing training speed.

Learning and Unlearning of Fabricated Knowledge in Language Models

no code implementations29 Oct 2024 Chen Sun, Nolan Andrew Miller, Andrey Zhmoginov, Max Vladymyrov, Mark Sandler

What happens when a new piece of knowledge is introduced into the training data and how long does it last while a large language model (LM) continues to train?

Data Poisoning Language Modeling +3

Fourier Head: Helping Large Language Models Learn Complex Probability Distributions

no code implementations29 Oct 2024 Nate Gillman, Daksh Aggarwal, Michael Freeman, Saurabh Singh, Chen Sun

As the quality of large language models has improved, there has been increased interest in using them to model non-linguistic tokens.

Decision Making Time Series Forecasting

Do Music Generation Models Encode Music Theory?

1 code implementation1 Oct 2024 Megan Wei, Michael Freeman, Chris Donahue, Chen Sun

Our findings suggest that music theory concepts are discernible within foundation models and that the degree to which they are detectable varies by model size and layer.

Emotion Recognition Genre classification +3

Do Pre-trained Vision-Language Models Encode Object States?

no code implementations16 Sep 2024 Kaleb Newman, Shijie Wang, Yuan Zang, David Heffren, Chen Sun

For a vision-language model (VLM) to understand the physical world, such as cause and effect, a first step is to capture the temporal dynamics of the visual world, for example how the physical states of objects evolve over time (e. g. a whole apple into a sliced apple).

Language Modeling Language Modelling +3

DiReDi: Distillation and Reverse Distillation for AIoT Applications

no code implementations12 Sep 2024 Chen Sun, Qing Tong, Wenshuang Yang, Wenqi Zhang

When the user needs to update the edge AI model to better fit the actual scenario, the reverse distillation (RD) process is employed to extract the knowledge: the difference between user preferences and the manufacturer's presumptions from the edge AI model using the user's exclusive data.

Knowledge Distillation Management

EPO: Hierarchical LLM Agents with Environment Preference Optimization

1 code implementation28 Aug 2024 Qi Zhao, Haotian Fu, Chen Sun, George Konidaris

Long-horizon decision-making tasks present significant challenges for LLM-based agents due to the need for extensive planning over multiple steps.

Action Generation Decision Making

Learning Visual Grounding from Generative Vision and Language Model

no code implementations18 Jul 2024 Shijie Wang, Dahun Kim, Ali Taalimi, Chen Sun, Weicheng Kuo

In this work, we explore whether generative VLMs predominantly trained on image-text data could be leveraged to scale up the text annotation of visual grounding data.

Attribute Language Modeling +8

Potential Based Diffusion Motion Planning

no code implementations8 Jul 2024 Yunhao Luo, Chen Sun, Joshua B. Tenenbaum, Yilun Du

An advantage of potential based motion planning is composability -- different motion constraints can be easily combined by adding corresponding potentials.

global-optimization Motion Planning

Edge AI-Enabled Chicken Health Detection Based on Enhanced FCOS-Lite and Knowledge Distillation

no code implementations3 Jul 2024 Qiang Tong, Jinrui Wang, Wenshuang Yang, Songtao Wu, Wenqi Zhang, Chen Sun, Kuanhong Xu

The utilization of AIoT technology has become a crucial trend in modern poultry management, offering the potential to optimize farming operations and reduce human workloads.

Knowledge Distillation Quantization

Text-Aware Diffusion for Policy Learning

no code implementations2 Jul 2024 Calvin Luo, Mandy He, Zilai Zeng, Chen Sun

Training an agent to achieve particular goals or perform desired behaviors is often accomplished through reinforcement learning, especially in the absence of expert demonstrations.

reinforcement-learning Reinforcement Learning

On the Combination of AI and Wireless Technologies: 3GPP Standardization Progress

no code implementations17 Jun 2024 Chen Sun, Tao Cui, Wenqi Zhang, Yingshuang Bai, Shuo Wang, Haojin Li

Combing Artificial Intelligence (AI) and wireless communication technologies has become one of the major technologies trends towards 2030.

Management

Pre-trained Vision-Language Models Learn Discoverable Visual Concepts

1 code implementation19 Apr 2024 Yuan Zang, Tian Yun, Hao Tan, Trung Bui, Chen Sun

Do vision-language models (VLMs) pre-trained to caption an image of a "durian" learn visual concepts such as "brown" (color) and "spiky" (texture) at the same time?

Precoder Design for User-Centric Network Massive MIMO with Matrix Manifold Optimization

no code implementations11 Apr 2024 Rui Sun, Li You, An-An Lu, Chen Sun, Xiqi Gao, Xiang-Gen Xia

In this paper, we investigate the precoder design for user-centric network (UCN) massive multiple-input multiple-output (mMIMO) downlink with matrix manifold optimization.

Computational Efficiency

Self-Correcting Self-Consuming Loops for Generative Model Training

1 code implementation11 Feb 2024 Nate Gillman, Michael Freeman, Daksh Aggarwal, Chia-Hong Hsu, Calvin Luo, Yonglong Tian, Chen Sun

As synthetic data becomes higher quality and proliferates on the internet, machine learning models are increasingly trained on a mix of human- and machine-generated data.

Motion Synthesis Representation Learning

Pixel-Aligned Language Model

no code implementations CVPR 2024 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs the model performs location-conditioned captioning which generates captions for the indicated object or region.

Language Modeling Language Modelling +1

Pixel Aligned Language Models

no code implementations14 Dec 2023 Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid

When taking locations as inputs, the model performs location-conditioned captioning, which generates captions for the indicated object or region.

Language Modeling Language Modelling

Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding

no code implementations30 Nov 2023 Rohan Myer Krishnan, Zitian Tang, Zhiqiu Yu, Chen Sun

To do this, video-language models must be able to obtain structured understandings, such as the temporal segmentation of a demonstration into sequences of actions and skills, and to generalize the understandings to novel domains.

Video Retrieval Video Understanding

Vamos: Versatile Action Models for Video Understanding

1 code implementation22 Nov 2023 Shijie Wang, Qi Zhao, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

To interpret the important text evidence for question answering, we generalize the concept bottleneck model to work with tokens and nonlinear models, which uses hard attention to select a small subset of tokens from the free-form text as inputs to the LLM reasoner.

EgoSchema Hard Attention +4

Towards A Unified Neural Architecture for Visual Recognition and Reasoning

no code implementations10 Nov 2023 Calvin Luo, Boqing Gong, Ting Chen, Chen Sun

Motivated by the recent success of multi-task transformers for visual recognition and language understanding, we propose a unified neural architecture for visual recognition and reasoning with a generic interface (e. g., tokens) for both.

Object object-detection +2

Analyzing Modular Approaches for Visual Question Decomposition

1 code implementation10 Nov 2023 Apoorv Khandelwal, Ellie Pavlick, Chen Sun

Modular neural networks without additional training have recently been shown to surpass end-to-end neural networks on challenging vision-language tasks.

Code Generation Visual Question Answering (VQA)

Towards Generic Anomaly Detection and Understanding: Large-scale Visual-linguistic Model (GPT-4V) Takes the Lead

1 code implementation5 Nov 2023 Yunkang Cao, Xiaohao Xu, Chen Sun, Xiaonan Huang, Weiming Shen

This study explores the use of GPT-4V(ision), a powerful visual-linguistic model, to address anomaly detection tasks in a generic manner.

3D Anomaly Detection Time Series

Emergence of Abstract State Representations in Embodied Sequence Modeling

no code implementations3 Nov 2023 Tian Yun, Zilai Zeng, Kunal Handa, Ashish V. Thapliyal, Bo Pang, Ellie Pavlick, Chen Sun

Decision making via sequence modeling aims to mimic the success of language models, where actions taken by an embodied agent are modeled as tokens to predict.

Decision Making

Object-centric Video Representation for Long-term Action Anticipation

1 code implementation31 Oct 2023 Ce Zhang, Changcheng Fu, Shijie Wang, Nakul Agarwal, Kwonjoon Lee, Chiho Choi, Chen Sun

To recognize and predict human-object interactions, we use a Transformer-based neural architecture which allows the "retrieval" of relevant objects for action anticipation at various time scales.

Action Anticipation Human-Object Interaction Detection +4

Delta-AI: Local objectives for amortized inference in sparse graphical models

1 code implementation3 Oct 2023 Jean-Pierre Falet, Hae Beom Lee, Nikolay Malkin, Chen Sun, Dragos Secrieru, Thomas Jiralerspong, Dinghuai Zhang, Guillaume Lajoie, Yoshua Bengio

We present a new algorithm for amortized inference in sparse probabilistic graphical models (PGMs), which we call $\Delta$-amortized inference ($\Delta$-AI).

Evaluating the Generation Capabilities of Large Chinese Language Models

2 code implementations9 Aug 2023 Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang

This paper unveils CG-Eval, the first-ever comprehensive and automated evaluation framework designed for assessing the generative capabilities of large Chinese language models across a spectrum of academic disciplines.

Text Generation

AntGPT: Can Large Language Models Help Long-term Action Anticipation from Videos?

1 code implementation31 Jul 2023 Qi Zhao, Shijie Wang, Ce Zhang, Changcheng Fu, Minh Quan Do, Nakul Agarwal, Kwonjoon Lee, Chen Sun

We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal.

Action Anticipation counterfactual +1

Does Visual Pretraining Help End-to-End Reasoning?

no code implementations NeurIPS 2023 Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid

A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.

Image Classification Object +3

NLOS Dies Twice: Challenges and Solutions of V2X for Cooperative Perception

no code implementations13 Jul 2023 Lantao Li, Chen Sun

Multi-agent multi-lidar sensor fusion between connected vehicles for cooperative perception has recently been recognized as the best technique for minimizing the blind zone of individual vehicular perception systems and further enhancing the overall safety of autonomous driving systems.

Autonomous Driving Sensor Fusion

Federated Learning over a Wireless Network: Distributed User Selection through Random Access

no code implementations7 Jul 2023 Chen Sun, Shiyao Ma, Ce Zheng, Songtao Wu, Tao Cui, Lingjuan Lyu

This study proposes a network intrinsic approach of distributed user selection that leverages the radio resource competition mechanism in random access.

Fairness Federated Learning

How can objects help action recognition?

1 code implementation CVPR 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.

Action Recognition Object

Dense Video Object Captioning from Disjoint Supervision

1 code implementation20 Jun 2023 Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid

We propose a new task and model for dense video object captioning -- detecting, tracking and captioning trajectories of objects in a video.

Object Sentence +2

Segment Any Anomaly without Training via Hybrid Prompt Regularization

2 code implementations18 May 2023 Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Zongwei Du, Liang Gao, Weiming Shen

We present a novel framework, i. e., Segment Any Anomaly + (SAA+), for zero-shot anomaly segmentation with hybrid prompt regularization to improve the adaptability of modern foundation models.

Anomaly Detection Anomaly Localization +3

Comparing Trajectory and Vision Modalities for Verb Representation

no code implementations8 Mar 2023 Dylan Ebert, Chen Sun, Ellie Pavlick

Given the importance of 3D space in formal models of verb semantics, we expect that these 2D images would result in impoverished representations that fail to capture nuanced differences in meaning.

Representation Learning

Steerable Equivariant Representation Learning

no code implementations22 Feb 2023 Sangnie Bhardwaj, Willie McClinton, Tongzhou Wang, Guillaume Lajoie, Chen Sun, Phillip Isola, Dilip Krishnan

In this paper, we propose a method of learning representations that are instead equivariant to data augmentations.

Image Retrieval object-detection +5

DEJA VU: Continual Model Generalization For Unseen Domains

2 code implementations25 Jan 2023 Chenxi Liu, Lixu Wang, Lingjuan Lyu, Chen Sun, Xiao Wang, Qi Zhu

To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains.

Data Augmentation Domain Generalization +1

Contrastive Retrospection: honing in on critical steps for rapid learning and generalization in RL

1 code implementation NeurIPS 2023 Chen Sun, Wannan Yang, Thomas Jiralerspong, Dane Malenfant, Benjamin Alsbury-Nealy, Yoshua Bengio, Blake Richards

Distinct from other contemporary RL approaches to credit assignment, ConSpec takes advantage of the fact that it is easier to retrospectively identify the small set of steps that success is contingent upon (and ignoring other states) than it is to prospectively predict reward at every taken step.

Contrastive Learning Out-of-Distribution Generalization +1

A New Knowledge Distillation Network for Incremental Few-Shot Surface Defect Detection

1 code implementation1 Sep 2022 Chen Sun, Liang Gao, Xinyu Li, Yiping Gao

The proposed DKAN method follows a pretraining-finetuning transfer learning paradigm and a knowledge distillation framework is designed for fine-tuning.

Defect Detection Knowledge Distillation +1

Do Trajectories Encode Verb Meaning?

no code implementations NAACL 2022 Dylan Ebert, Chen Sun, Ellie Pavlick

Distributional models learn representations of words from text, but are criticized for their lack of grounding, or the linking of text to the non-linguistic world.

Representation Learning

AVATAR: Unconstrained Audiovisual Speech Recognition

1 code implementation15 Jun 2022 Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Learning Audio-Video Modalities from Image Captions

no code implementations1 Apr 2022 Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid

To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.

Image Captioning Retrieval +4

Do Vision-Language Pretrained Models Learn Composable Primitive Concepts?

1 code implementation31 Mar 2022 Tian Yun, Usha Bhalla, Ellie Pavlick, Chen Sun

CompMap first asks a VL model to generate primitive concept activations with text prompts, and then learns to construct a composition model that maps the primitive concept activations (e. g. the likelihood of black tail or red wing) to composite concepts (e. g. a red-winged blackbird).

Fine-Grained Visual Recognition Multimodal Reasoning +1

Trajectory balance: Improved credit assignment in GFlowNets

4 code implementations31 Jan 2022 Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, Yoshua Bengio

Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object.

Diversity

Multiview Transformers for Video Recognition

1 code implementation CVPR 2022 Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid

Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.

Ranked #5 on Action Classification on MiT (using extra training data)

Action Classification Action Recognition +1

Masking Modalities for Cross-modal Video Retrieval

no code implementations1 Nov 2021 Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid

Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.

Retrieval Video Retrieval

Does Vision-and-Language Pretraining Improve Lexical Grounding?

1 code implementation Findings (EMNLP) 2021 Tian Yun, Chen Sun, Ellie Pavlick

Linguistic representations derived from text alone have been criticized for their lack of grounding, i. e., connecting words to their meanings in the physical world.

Question Answering Visual Question Answering

DenseTNT: End-to-end Trajectory Prediction from Dense Goal Sets

2 code implementations ICCV 2021 Junru Gu, Chen Sun, Hang Zhao

In this work, we propose an anchor-free and end-to-end trajectory prediction model, named DenseTNT, that directly outputs a set of trajectories from dense goal candidates.

Motion Forecasting motion prediction +2

Discrete-Valued Neural Communication

no code implementations NeurIPS 2021 Dianbo Liu, Alex Lamb, Kenji Kawaguchi, Anirudh Goyal, Chen Sun, Michael Curtis Mozer, Yoshua Bengio

Deep learning has advanced from fully connected architectures to structured models organized into components, e. g., the transformer composed of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes.

Quantization Systematic Generalization

Attention Bottlenecks for Multimodal Fusion

1 code implementation NeurIPS 2021 Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun

Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.

Action Classification Action Recognition +2

Episodic Transformer for Vision-and-Language Navigation

1 code implementation ICCV 2021 Alexander Pashevich, Cordelia Schmid, Chen Sun

We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.

Vision and Language Navigation

Unified Graph Structured Models for Video Understanding

no code implementations ICCV 2021 Anurag Arnab, Chen Sun, Cordelia Schmid

Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.

Action Detection Graph Classification +4

ViViT: A Video Vision Transformer

10 code implementations ICCV 2021 Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid

We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.

Ranked #8 on Action Classification on MiT (Top 5 Accuracy metric, using extra training data)

Action Classification Action Recognition +4

Learning Temporal Dynamics from Cycles in Narrated Video

no code implementations ICCV 2021 Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun

Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.

Multi-modal Transformer for Video Retrieval

1 code implementation ECCV 2020 Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid

In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.

 Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT (text-to-video Mean Rank metric, using extra training data)

Natural Language Queries Retrieval +2

What Makes for Good Views for Contrastive Learning?

1 code implementation NeurIPS 2020 Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola

Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.

Contrastive Learning Data Augmentation +8

VectorNet: Encoding HD Maps and Agent Dynamics from Vectorized Representation

4 code implementations CVPR 2020 Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid

Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).

Graph Neural Network Self-Driving Cars

Speech2Action: Cross-modal Supervision for Action Recognition

no code implementations CVPR 2020 Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman

We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.

Action Recognition

Automated Pyramid Summarization Evaluation

1 code implementation CONLL 2019 Yanjun Gao, Chen Sun, Rebecca J. Passonneau

Pyramid evaluation was developed to assess the content of paragraph length summaries of source texts.

Learning Video Representations using Contrastive Bidirectional Transformer

no code implementations13 Jun 2019 Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid

This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +5

Predicting the Present and Future States of Multi-agent Systems from Partially-observed Visual Data

no code implementations ICLR 2019 Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy

We present a method which learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.

Modeling Parts, Structure, and System Dynamics via Predictive Learning

no code implementations ICLR 2019 Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.

Object

Affordance Learning In Direct Perception for Autonomous Driving

no code implementations20 Mar 2019 Chen Sun, Jean M. Uwabeza Vianney, Dongpu Cao

Our results indicate that this method could act as a cheaper way for training data collection in autonomous driving.

Attribute Autonomous Driving +1

Unsupervised Discovery of Parts, Structure, and Dynamics

no code implementations12 Mar 2019 Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu

Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.

Object

Stochastic Prediction of Multi-Agent Interactions from Partial Observations

no code implementations25 Feb 2019 Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy

We present a method that learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.

D3D: Distilled 3D Networks for Video Action Recognition

1 code implementation19 Dec 2018 Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar

State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input.

Action Classification Action Recognition +2

Composing Text and Image for Image Retrieval - An Empirical Odyssey

4 code implementations CVPR 2019 Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays

In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image.

Image Retrieval Image Retrieval with Multi-Modal Query +1

Actor-Centric Relation Network

1 code implementation ECCV 2018 Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid

A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.

Action Classification Action Detection +5

Large Scale Fine-Grained Categorization and Domain-Specific Transfer Learning

1 code implementation CVPR 2018 Yin Cui, Yang song, Chen Sun, Andrew Howard, Serge Belongie

We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure.

Fine-Grained Image Classification Fine-Grained Visual Categorization +1

Rethinking Spatiotemporal Feature Learning: Speed-Accuracy Trade-offs in Video Classification

2 code implementations ECCV 2018 Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, Kevin Murphy

Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification.

Ranked #29 on Action Recognition on UCF101 (using extra training data)

Action Classification Action Detection +6

VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

1 code implementation ICCV 2017 Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, Boqing Gong

Many seemingly distant annotations (e. g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e. g., of COCO).

Language Modeling Language Modelling +5

The iNaturalist Species Classification and Detection Dataset

19 code implementations CVPR 2018 Grant Van Horn, Oisin Mac Aodha, Yang song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie

Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories.

General Classification Image Classification

AVA: A Video Dataset of Spatio-temporally Localized Atomic Visual Actions

9 code implementations CVPR 2018 Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik

The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.

Actin Detection Action Detection +3

TALL: Temporal Activity Localization via Language Query

12 code implementations ICCV 2017 Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia

For evaluation, we adopt TaCoS dataset, and build a new dataset for this task on top of Charades by adding sentence temporal annotations, called Charades-STA.

Natural Language Queries regression +2

TURN TAP: Temporal Unit Regression Network for Temporal Action Proposals

1 code implementation ICCV 2017 Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, Ram Nevatia

Temporal Action Proposal (TAP) generation is an important problem, as fast and accurate extraction of semantically important (e. g. human actions) segments from untrimmed videos is an important step for large-scale video analysis.

regression Temporal Action Localization

Complex Event Recognition from Images with Few Training Examples

no code implementations17 Jan 2017 Unaiza Ahsan, Chen Sun, James Hays, Irfan Essa

We propose to leverage concept-level representations for complex event recognition in photographs given limited training examples.

Speed/accuracy trade-offs for modern convolutional object detectors

14 code implementations CVPR 2017 Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang song, Sergio Guadarrama, Kevin Murphy

On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

Ranked #226 on Object Detection on COCO test-dev (using extra training data)

Object object-detection +1

ACD: Action Concept Discovery from Image-Sentence Corpora

no code implementations16 Apr 2016 Jiyang Gao, Chen Sun, Ram Nevatia

It obtains candidate action concepts by extracting verb-object pairs from sentences and verifies their visualness with the associated images.

Action Classification Classification +2

Temporal Localization of Fine-Grained Actions in Videos by Domain Transfer from Web Images

1 code implementation4 Apr 2015 Chen Sun, Sanketh Shetty, Rahul Sukthankar, Ram Nevatia

To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output.

Action Recognition Temporal Action Localization +1

DISCOVER: Discovering Important Segments for Classification of Video Events and Recounting

no code implementations CVPR 2014 Chen Sun, Ram Nevatia

Our goal is to find the important segments and capture their information for event classification and recounting.

General Classification

Cannot find the paper you are looking for? You can Submit a new open access paper.