2 code implementations • 9 Aug 2023 • Hui Zeng, Jingyuan Xue, Meng Hao, Chen Sun, Bin Ning, Na Zhang
This paper presents CG-Eval, the first comprehensive evaluation of the generation capabilities of large Chinese language models across a wide range of academic disciplines.
no code implementations • 31 Jul 2023 • Qi Zhao, Ce Zhang, Shijie Wang, Changcheng Fu, Nakul Agarwal, Kwonjoon Lee, Chen Sun
We propose to formulate the LTA task from two perspectives: a bottom-up approach that predicts the next actions autoregressively by modeling temporal dynamics; and a top-down approach that infers the goal of the actor and plans the needed procedure to accomplish the goal.
no code implementations • 17 Jul 2023 • Chen Sun, Calvin Luo, Xingyi Zhou, Anurag Arnab, Cordelia Schmid
A positive result would refute the common belief that explicit visual abstraction (e. g. object detection) is essential for compositional generalization on visual reasoning, and confirm the feasibility of a neural network "generalist" to solve visual recognition and reasoning tasks.
no code implementations • 13 Jul 2023 • Lantao Li, Chen Sun
Multi-agent multi-lidar sensor fusion between connected vehicles for cooperative perception has recently been recognized as the best technique for minimizing the blind zone of individual vehicular perception systems and further enhancing the overall safety of autonomous driving systems.
no code implementations • 7 Jul 2023 • Chen Sun, Shiyao Ma, Ce Zheng, Songtao Wu, Tao Cui, Lingjuan Lyu
This study proposes a network intrinsic approach of distributed user selection that leverages the radio resource competition mechanism in random access.
no code implementations • 7 Jul 2023 • Zilai Zeng, Ce Zhang, Shijie Wang, Chen Sun
Recent work has demonstrated the effectiveness of formulating decision making as a supervised learning problem on offline-collected trajectories.
1 code implementation • CVPR 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy.
1 code implementation • 20 Jun 2023 • Xingyi Zhou, Anurag Arnab, Chen Sun, Cordelia Schmid
We show our task is more general than grounding, and models trained on our task can directly be applied to grounding by finding the bounding box with the maximum likelihood of generating the query sentence.
no code implementations • 15 Jun 2023 • Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Liang Gao, Weiming Shen
This technical report introduces the winning solution of the team Segment Any Anomaly for the CVPR2023 Visual Anomaly and Novelty Detection (VAND) challenge.
no code implementations • 13 Jun 2023 • Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David A Ross, Cordelia Schmid, Alireza Fathi
Second, we use examples of user decision-making to provide our LLM-powered planner and reasoner with relevant contextual instances, enhancing their capacity to make informed decisions.
2 code implementations • 18 May 2023 • Yunkang Cao, Xiaohao Xu, Chen Sun, Yuqi Cheng, Zongwei Du, Liang Gao, Weiming Shen
We present a novel framework, i. e., Segment Any Anomaly + (SAA+), for zero-shot anomaly segmentation with hybrid prompt regularization to improve the adaptability of modern foundation models.
Ranked #1 on
Anomaly Detection
on KSDD2
no code implementations • 24 Apr 2023 • Alexey Gritsenko, Xuehan Xiong, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid, Anurag Arnab
The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks.
Ranked #1 on
Action Recognition
on AVA v2.1
(using extra training data)
no code implementations • 8 Mar 2023 • Dylan Ebert, Chen Sun, Ellie Pavlick
Given the importance of 3D space in formal models of verb semantics, we expect that these 2D images would result in impoverished representations that fail to capture nuanced differences in meaning.
no code implementations • 22 Feb 2023 • Sangnie Bhardwaj, Willie McClinton, Tongzhou Wang, Guillaume Lajoie, Chen Sun, Phillip Isola, Dilip Krishnan
In this paper, we propose a method of learning representations that are instead equivariant to data augmentations.
2 code implementations • 25 Jan 2023 • Chenxi Liu, Lixu Wang, Lingjuan Lyu, Chen Sun, Xiao Wang, Qi Zhu
To overcome these limitations of DA and DG in handling the Unfamiliar Period during continual domain shift, we propose RaTP, a framework that focuses on improving models' target domain generalization (TDG) capability, while also achieving effective target domain adaptation (TDA) capability right after training on certain domains and forgetting alleviation (FA) capability on past domains.
1 code implementation • CVPR 2023 • Ziniu Hu, Ahmet Iscen, Chen Sun, ZiRui Wang, Kai-Wei Chang, Yizhou Sun, Cordelia Schmid, David A. Ross, Alireza Fathi
REVEAL consists of four key components: the memory, the encoder, the retriever and the generator.
Ranked #1 on
Visual Question Answering (VQA)
on A-OKVQA
(Accuracy metric)
1 code implementation • 12 Oct 2022 • Chen Sun, Wannan Yang, Thomas Jiralerspong, Dane Malenfant, Benjamin Alsbury-Nealy, Yoshua Bengio, Blake Richards
These critical steps are challenging to identify with traditional reinforcement learning (RL) methods that rely on the Bellman equation for credit assignment.
1 code implementation • 1 Sep 2022 • Chen Sun, Liang Gao, Xinyu Li, Yiping Gao
The proposed DKAN method follows a pretraining-finetuning transfer learning paradigm and a knowledge distillation framework is designed for fine-tuning.
no code implementations • 14 Aug 2022 • Medhini Narasimhan, Arsha Nagrani, Chen Sun, Michael Rubinstein, Trevor Darrell, Anna Rohrbach, Cordelia Schmid
In this work, we focus on summarizing instructional videos, an under-explored area of video summarization.
no code implementations • 8 Jul 2022 • Anurag Arnab, Xuehan Xiong, Alexey Gritsenko, Rob Romijnders, Josip Djolonga, Mostafa Dehghani, Chen Sun, Mario Lučić, Cordelia Schmid
Transfer learning is the predominant paradigm for training deep networks on small target datasets.
no code implementations • NAACL 2022 • Dylan Ebert, Chen Sun, Ellie Pavlick
Distributional models learn representations of words from text, but are criticized for their lack of grounding, or the linking of text to the non-linguistic world.
1 code implementation • 15 Jun 2022 • Valentin Gabeur, Paul Hongsuck Seo, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Audio-visual automatic speech recognition (AV-ASR) is an extension of ASR that incorporates visual cues, often from the movements of a speaker's mouth.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+1
no code implementations • 1 Apr 2022 • Arsha Nagrani, Paul Hongsuck Seo, Bryan Seybold, Anja Hauth, Santiago Manen, Chen Sun, Cordelia Schmid
To close this gap we propose a new video mining pipeline which involves transferring captions from image captioning datasets to video clips with no additional manual effort.
Ranked #19 on
Zero-Shot Video Retrieval
on MSR-VTT
(using extra training data)
no code implementations • 31 Mar 2022 • Tian Yun, Usha Bhalla, Ellie Pavlick, Chen Sun
CompMap first asks a VL model to generate primitive concept activations with text prompts, and then learns to construct a composition model that maps the primitive concept activations (e. g. the likelihood of black tail or red wing) to composite concepts (e. g. a red-winged blackbird).
2 code implementations • 31 Jan 2022 • Nikolay Malkin, Moksh Jain, Emmanuel Bengio, Chen Sun, Yoshua Bengio
Generative flow networks (GFlowNets) are a method for learning a stochastic policy for generating compositional objects, such as graphs or strings, from a given unnormalized density by sequences of actions, where many possible action sequences may lead to the same object.
1 code implementation • CVPR 2022 • Shen Yan, Xuehan Xiong, Anurag Arnab, Zhichao Lu, Mi Zhang, Chen Sun, Cordelia Schmid
Video understanding requires reasoning at multiple spatiotemporal resolutions -- from short fine-grained motions to events taking place over longer durations.
Ranked #4 on
Action Recognition
on EPIC-KITCHENS-100
(using extra training data)
no code implementations • 1 Nov 2021 • Valentin Gabeur, Arsha Nagrani, Chen Sun, Karteek Alahari, Cordelia Schmid
Our proposal is to pre-train a video encoder using all the available video modalities as supervision, namely, appearance, sound, and transcribed speech.
1 code implementation • Findings (EMNLP) 2021 • Tian Yun, Chen Sun, Ellie Pavlick
Linguistic representations derived from text alone have been criticized for their lack of grounding, i. e., connecting words to their meanings in the physical world.
2 code implementations • ICCV 2021 • Junru Gu, Chen Sun, Hang Zhao
In this work, we propose an anchor-free and end-to-end trajectory prediction model, named DenseTNT, that directly outputs a set of trajectories from dense goal candidates.
no code implementations • NeurIPS 2021 • Dianbo Liu, Alex Lamb, Kenji Kawaguchi, Anirudh Goyal, Chen Sun, Michael Curtis Mozer, Yoshua Bengio
Deep learning has advanced from fully connected architectures to structured models organized into components, e. g., the transformer composed of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes.
1 code implementation • NeurIPS 2021 • Arsha Nagrani, Shan Yang, Anurag Arnab, Aren Jansen, Cordelia Schmid, Chen Sun
Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio.
Ranked #1 on
Action Classification
on Kinetics-Sounds
no code implementations • CVPR 2021 • Lu Mi, Hang Zhao, Charlie Nash, Xiaohan Jin, Jiyang Gao, Chen Sun, Cordelia Schmid, Nir Shavit, Yuning Chai, Dragomir Anguelov
To address this issue, we introduce a new challenging task to generate HD maps.
1 code implementation • ICCV 2021 • Alexander Pashevich, Cordelia Schmid, Chen Sun
We demonstrate that encoding the history with a transformer is critical to solve compositional tasks, and that pretraining and joint training with synthetic instructions further improve the performance.
1 code implementation • 6 Apr 2021 • Jack Valmadre, Alex Bewley, Jonathan Huang, Chen Sun, Cristian Sminchisescu, Cordelia Schmid
This paper introduces temporally local metrics for Multi-Object Tracking.
no code implementations • ICCV 2021 • Chen Sun, Arsha Nagrani, Yonglong Tian, Cordelia Schmid
We focus on contrastive methods for self-supervised video representation learning.
no code implementations • ICCV 2021 • Anurag Arnab, Chen Sun, Cordelia Schmid
Accurate video understanding involves reasoning about the relationships between actors, objects and their environment, often over long temporal intervals.
6 code implementations • ICCV 2021 • Anurag Arnab, Mostafa Dehghani, Georg Heigold, Chen Sun, Mario Lučić, Cordelia Schmid
We present pure-transformer based models for video classification, drawing upon the recent success of such models in image classification.
Ranked #8 on
Action Classification
on Moments in Time
(Top 5 Accuracy metric, using extra
training data)
no code implementations • ICCV 2021 • Dave Epstein, Jiajun Wu, Cordelia Schmid, Chen Sun
Learning to model how the world changes as time elapses has proven a challenging problem for the computer vision community.
4 code implementations • 19 Aug 2020 • Hang Zhao, Jiyang Gao, Tian Lan, Chen Sun, Benjamin Sapp, Balakrishnan Varadarajan, Yue Shen, Yi Shen, Yuning Chai, Cordelia Schmid, Cong-Cong Li, Dragomir Anguelov
Our key insight is that for prediction within a moderate time horizon, the future modes can be effectively captured by a set of target states.
1 code implementation • 3 Aug 2020 • Samuel Albanie, Yang Liu, Arsha Nagrani, Antoine Miech, Ernesto Coto, Ivan Laptev, Rahul Sukthankar, Bernard Ghanem, Andrew Zisserman, Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid, Shi-Zhe Chen, Yida Zhao, Qin Jin, Kaixu Cui, Hui Liu, Chen Wang, Yudong Jiang, Xiaoshuai Hao
This report summarizes the results of the first edition of the challenge together with the findings of the participants.
no code implementations • 29 Jul 2020 • Jonathan C. Stroud, Zhichao Lu, Chen Sun, Jia Deng, Rahul Sukthankar, Cordelia Schmid, David A. Ross
Based on this observation, we propose to use text as a method for learning video representations.
1 code implementation • ECCV 2020 • Valentin Gabeur, Chen Sun, Karteek Alahari, Cordelia Schmid
In this paper, we present a multi-modal transformer to jointly encode the different modalities in video, which allows each of them to attend to the others.
Ranked #1 on
Zero-Shot Video Retrieval
on MSR-VTT
(text-to-video Mean Rank metric, using extra
training data)
no code implementations • ECCV 2020 • Anurag Arnab, Chen Sun, Arsha Nagrani, Cordelia Schmid
Despite the recent advances in video classification, progress in spatio-temporal action recognition has lagged behind.
1 code implementation • NeurIPS 2020 • Yonglong Tian, Chen Sun, Ben Poole, Dilip Krishnan, Cordelia Schmid, Phillip Isola
Contrastive learning between multiple views of the data has recently achieved state of the art performance in the field of self-supervised representation learning.
Ranked #2 on
Contrastive Learning
on imagenet-1k
3 code implementations • CVPR 2020 • Jiyang Gao, Chen Sun, Hang Zhao, Yi Shen, Dragomir Anguelov, Cong-Cong Li, Cordelia Schmid
Behavior prediction in dynamic, multi-agent systems is an important problem in the context of self-driving cars, due to the complex representations and interactions of road components, including moving agents (e. g. pedestrians and vehicles) and road context information (e. g. lanes, traffic lights).
no code implementations • CVPR 2020 • Arsha Nagrani, Chen Sun, David Ross, Rahul Sukthankar, Cordelia Schmid, Andrew Zisserman
We train a BERT-based Speech2Action classifier on over a thousand movie screenplays, to predict action labels from transcribed speech segments.
1 code implementation • CONLL 2019 • Yanjun Gao, Chen Sun, Rebecca J. Passonneau
Pyramid evaluation was developed to assess the content of paragraph length summaries of source texts.
1 code implementation • NeurIPS 2019 • Matthias Minderer, Chen Sun, Ruben Villegas, Forrester Cole, Kevin Murphy, Honglak Lee
Extracting and predicting object structure and dynamics from videos without supervision is a major challenge in machine learning.
Ranked #11 on
Video Prediction
on KTH
no code implementations • 13 Jun 2019 • Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid
This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods.
Automatic Speech Recognition
Automatic Speech Recognition (ASR)
+5
no code implementations • ICLR 2019 • Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu
Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.
no code implementations • ICLR 2019 • Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy
We present a method which learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.
no code implementations • CVPR 2019 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Rahul Sukthankar, Kevin Murphy, Cordelia Schmid
This paper focuses on multi-person action forecasting in videos.
3 code implementations • ICCV 2019 • Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, Cordelia Schmid
Self-supervised learning has become increasingly important to leverage the abundance of unlabeled data available on platforms like YouTube.
Ranked #1 on
Action Classification
on YouCook2
no code implementations • 20 Mar 2019 • Chen Sun, Jean M. Uwabeza Vianney, Dongpu Cao
Our results indicate that this method could act as a cheaper way for training data collection in autonomous driving.
no code implementations • 12 Mar 2019 • Zhenjia Xu, Zhijian Liu, Chen Sun, Kevin Murphy, William T. Freeman, Joshua B. Tenenbaum, Jiajun Wu
Humans easily recognize object parts and their hierarchical structure by watching how they move; they can then predict how each part moves in the future.
no code implementations • 25 Feb 2019 • Chen Sun, Per Karlsson, Jiajun Wu, Joshua B. Tenenbaum, Kevin Murphy
We present a method that learns to integrate temporal information, from a learned dynamics model, with ambiguous visual information, from a learned vision model, in the context of interacting agents.
no code implementations • 13 Feb 2019 • Chen Sun, Ye Tian, Liang Gao, Yishuai Niu, Tianlong Zhang, Hua Li, Yuqing Zhang, Zengqi Yue, Nicole Delepine-Gilon, Jin Yu
Machine learning has been used to develop the model.
1 code implementation • 19 Dec 2018 • Jonathan C. Stroud, David A. Ross, Chen Sun, Jia Deng, Rahul Sukthankar
State-of-the-art methods for video action recognition commonly use an ensemble of two networks: the spatial stream, which takes RGB frames as input, and the temporal stream, which takes optical flow as input.
Ranked #11 on
Action Recognition
on AVA v2.1
4 code implementations • CVPR 2019 • Nam Vo, Lu Jiang, Chen Sun, Kevin Murphy, Li-Jia Li, Li Fei-Fei, James Hays
In this paper, we study the task of image retrieval, where the input query is specified in the form of an image plus some text that describes desired modifications to the input image.
Ranked #2 on
Image Retrieval with Multi-Modal Query
on MIT-States
1 code implementation • ECCV 2018 • Chen Sun, Abhinav Shrivastava, Carl Vondrick, Kevin Murphy, Rahul Sukthankar, Cordelia Schmid
A visualization of the learned relation features confirms that our approach is able to attend to the relevant relations for each action.
Ranked #15 on
Action Recognition
on AVA v2.1
1 code implementation • CVPR 2018 • Yin Cui, Yang song, Chen Sun, Andrew Howard, Serge Belongie
We propose a measure to estimate domain similarity via Earth Mover's Distance and demonstrate that transfer learning benefits from pre-training on a source domain that is similar to the target domain by this measure.
Ranked #27 on
Fine-Grained Image Classification
on CUB-200-2011
Fine-Grained Image Classification
Fine-Grained Visual Categorization
+1
no code implementations • 22 Jan 2018 • Unaiza Ahsan, Chen Sun, Irfan Essa
We propose an action recognition framework using Gen- erative Adversarial Networks.
1 code implementation • ECCV 2018 • Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, Kevin Murphy
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification.
Ranked #24 on
Action Recognition
on UCF101
(using extra training data)
1 code implementation • ICCV 2017 • Chuang Gan, Yandong Li, Haoxiang Li, Chen Sun, Boqing Gong
Many seemingly distant annotations (e. g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e. g., of COCO).
13 code implementations • CVPR 2018 • Grant Van Horn, Oisin Mac Aodha, Yang song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, Serge Belongie
Existing image classification datasets used in computer vision tend to have a uniform distribution of images across object categories.
Ranked #8 on
Image Classification
on iNaturalist
2 code implementations • ICCV 2017 • Chen Sun, Abhinav Shrivastava, Saurabh Singh, Abhinav Gupta
What will happen if we increase the dataset size by 10x or 100x?
Ranked #2 on
Semantic Segmentation
on PASCAL VOC 2007
6 code implementations • CVPR 2018 • Chunhui Gu, Chen Sun, David A. Ross, Carl Vondrick, Caroline Pantofaru, Yeqing Li, Sudheendra Vijayanarasimhan, George Toderici, Susanna Ricco, Rahul Sukthankar, Cordelia Schmid, Jitendra Malik
The AVA dataset densely annotates 80 atomic visual actions in 430 15-minute video clips, where actions are localized in space and time, resulting in 1. 58M action labels with multiple labels per person occurring frequently.
Ranked #6 on
Action Detection
on UCF101-24
9 code implementations • ICCV 2017 • Jiyang Gao, Chen Sun, Zhenheng Yang, Ram Nevatia
For evaluation, we adopt TaCoS dataset, and build a new dataset for this task on top of Charades by adding sentence temporal annotations, called Charades-STA.
1 code implementation • ICCV 2017 • Jiyang Gao, Zhenheng Yang, Chen Sun, Kan Chen, Ram Nevatia
Temporal Action Proposal (TAP) generation is an important problem, as fast and accurate extraction of semantically important (e. g. human actions) segments from untrimmed videos is an important step for large-scale video analysis.
Ranked #8 on
Action Recognition
on THUMOS’14
no code implementations • 17 Jan 2017 • Unaiza Ahsan, Chen Sun, James Hays, Irfan Essa
We propose to leverage concept-level representations for complex event recognition in photographs given limited training examples.
14 code implementations • CVPR 2017 • Jonathan Huang, Vivek Rathod, Chen Sun, Menglong Zhu, Anoop Korattikara, Alireza Fathi, Ian Fischer, Zbigniew Wojna, Yang song, Sergio Guadarrama, Kevin Murphy
On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.
Ranked #224 on
Object Detection
on COCO test-dev
no code implementations • 16 Apr 2016 • Jiyang Gao, Chen Sun, Ram Nevatia
It obtains candidate action concepts by extracting verb-object pairs from sentences and verifies their visualness with the associated images.
no code implementations • CVPR 2016 • Chen Sun, Manohar Paluri, Ronan Collobert, Ram Nevatia, Lubomir Bourdev
This paper aims to classify and locate objects accurately and efficiently, without using bounding box annotations.
Ranked #5 on
Weakly Supervised Object Detection
on COCO
no code implementations • ICCV 2015 • Chen Sun, Chuang Gan, Ram Nevatia
Humans connect language and vision to perceive the world.
1 code implementation • 4 Apr 2015 • Chen Sun, Sanketh Shetty, Rahul Sukthankar, Ram Nevatia
To solve this problem, we propose a simple yet effective method that takes weak video labels and noisy image labels as input, and generates localized action frames as output.
no code implementations • CVPR 2014 • Chen Sun, Ram Nevatia
Our goal is to find the important segments and capture their information for event classification and recounting.