2 code implementations • 11 May 2022 • Yawei Li, Kai Zhang, Radu Timofte, Luc van Gool, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Liu, Chenhui Zhou, Jingyi Chen, Qingrui Han, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Yu Qiao, Chao Dong, Long Sun, Jinshan Pan, Yi Zhu, Zhikai Zong, Xiaoxiao Liu, Zheng Hui, Tao Yang, Peiran Ren, Xuansong Xie, Xian-Sheng Hua, Yanbo Wang, Xiaozhong Ji, Chuming Lin, Donghao Luo, Ying Tai, Chengjie Wang, Zhizhong Zhang, Yuan Xie, Shen Cheng, Ziwei Luo, Lei Yu, Zhihong Wen, Qi Wu1, Youwei Li, Haoqiang Fan, Jian Sun, Shuaicheng Liu, Yuanfei Huang, Meiguang Jin, Hua Huang, Jing Liu, Xinjian Zhang, Yan Wang, Lingshun Long, Gen Li, Yuanfan Zhang, Zuowei Cao, Lei Sun, Panaetov Alexander, Yucong Wang, Minjie Cai, Li Wang, Lu Tian, Zheyuan Wang, Hongbing Ma, Jie Liu, Chao Chen, Yidong Cai, Jie Tang, Gangshan Wu, Weiran Wang, Shirui Huang, Honglei Lu, Huan Liu, Keyan Wang, Jun Chen, Shi Chen, Yuchun Miao, Zimo Huang, Lefei Zhang, Mustafa Ayazoğlu, Wei Xiong, Chengyi Xiong, Fei Wang, Hao Li, Ruimian Wen, Zhijing Yang, Wenbin Zou, Weixin Zheng, Tian Ye, Yuncheng Zhang, Xiangzhen Kong, Aditya Arora, Syed Waqas Zamir, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Dandan Gaoand Dengwen Zhouand Qian Ning, Jingzhu Tang, Han Huang, YuFei Wang, Zhangheng Peng, Haobo Li, Wenxue Guan, Shenghua Gong, Xin Li, Jun Liu, Wanjun Wang, Dengwen Zhou, Kun Zeng, Hanjiang Lin, Xinyu Chen, Jinsheng Fang
The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29. 00dB on DIV2K validation set.
Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems.
Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters.
Engineering design of origami systems is challenging because comparing different origami patterns requires using categorical features and evaluating multi-physics behavior targets introduces multi-objective problems.
Conventional 3D object detection approaches concentrate on bounding boxes representation learning with several parameters, i. e., localization, dimension, and orientation.
Multiple datasets and open challenges for object detection have been introduced in recent years.
Ranked #1 on Object Detection on BigDetection val
Building Spoken Language Understanding (SLU) robust to Automatic Speech Recognition (ASR) errors is an essential issue for various voice-enabled virtual assistants.
In the short text, the extremely short length, feature sparsity, and high ambiguity pose huge challenges to classification tasks.
The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs from sets of negative samples.
The availability of parallel sentence simplification (SS) is scarce for neural SS modelings.
Recognizing and localizing objects in the 3D space is a crucial ability for an AI agent to perceive its surrounding environment.
Our formulation is able to capture global context in a video, thus robust to temporal content change.
Semantic segmentation is a challenging problem due to difficulties in modeling context in complex scenes and class confusions along boundaries.
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target by destroying the most instructive information in instructions at different timesteps.
We introduce a system for optimal resource allocation that can predict performance with aggressive trade-offs, for both new and past observed queries.
A navigation agent is supposed to have various intelligent skills, such as visual perceiving, mapping, planning, exploring and reasoning, etc.
Neural network-based semantic segmentation has achieved remarkable results when large amounts of annotated data are available, that is, in the supervised case.
Previous works study the adversarial robustness of image classifiers on image level and use all the pixel information in an image indiscriminately, lacking of exploration of regions with different semantic meanings in the pixel space of an image.
To better exploit the intrinsic structure of the target domain, we propose Domain Consensus Clustering (DCC), which exploits the domain consensus knowledge to discover discriminative clusters on both common samples and private ones.
Ranked #1 on Universal Domain Adaptation on Office-Home
We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage.
Ranked #11 on Action Classification on Charades
In this task, an agent is required to navigate from an arbitrary position in a 3D embodied environment to localize a target following a scene description.
Numerical simulations first illustrate the consistency of theoretical results on the sharp interface limit.
Numerical Analysis Numerical Analysis 76Z99, 92B05, 76R50
Weyl points are degenerate points on the spectral bands at which energy bands intersect conically.
Mathematical Physics Mathematical Physics Spectral Theory
Semi-supervised learning through deep generative models and multi-lingual pretraining techniques have orchestrated tremendous success across different areas of NLP.
CrossNorm exchanges styles between feature channels to perform style augmentation, diversifying the content and style mixtures.
Vision-Dialog Navigation (VDN) requires an agent to ask questions and navigate following the human responses to find target objects.
Few-shot crosslingual transfer has been shown to outperform its zero-shot counterpart with pretrained encoders like multilingual BERT.
In the world of action recognition research, one primary focus has been on how to construct and train networks to model the spatial-temporal volume of an input video.
The global existence of solutions to incompressible viscoelastic flows has been a longstanding open problem, even for the global weak solution.
Analysis of PDEs 76A10, 76D03, 35B65
Video action recognition is one of the representative tasks for video understanding.
Land-cover classification using remote sensing imagery is an important Earth observation task.
The vulnerability of deep neural networks (DNNs) to adversarial attack, which is an attack that can mislead state-of-the-art classifiers into making an incorrect classification with high confidence by deliberately perturbing the original inputs, raises concerns about the robustness of DNNs to such attacks.
Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning, to simplify the sentence.
The onset of hydrodynamic instabilities is of great importance in both industry and daily life, due to the dramatic mechanical and thermodynamic changes for different types of flow motions.
Due to the intrinsic complexity and nonlinearity of chemical reactions, direct applications of traditional machine learning algorithms may face with many difficulties.
Finally, we demonstrate that adversarial training with SAGE augmented data can improve performance and robustness of TableQA systems.
To derive the hidden dynamics from observed data is one of the fundamental but also challenging problems in many different fields.
In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models.
It is well known that featuremap attention and multi-path representation are important for visual recognition.
Ranked #5 on Instance Segmentation on COCO test-dev (APS metric)
Most of previous work on adversarial attack mainly focus on image models, while the vulnerability of video models is less explored.
Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step.
Inspired by this, we investigate methods to inform or guide deep learning models for geospatial image analysis to increase their performance when a limited amount of training data is available or when they are applied to scenarios other than which they were trained on.
In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information.
Ranked #7 on Vision and Language Navigation on VLN Challenge
In this paper, we study a simple algorithm to construct asymptotically valid confidence regions for model parameters using the batch means method.
Despite an ever growing literature on reinforcement learning algorithms and applications, much less is known about their statistical inference.
Recent work has validated the importance of subword information for word representation learning.
Domain adaptation aims to exploit the knowledge in source domain to promote the learning tasks in target domain, which plays a critical role in real-world applications.
Motivated by our observation that motion information is the key to good anomaly detection performance in video, we propose a temporal augmented network to learn a motion-aware feature.
Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning.
4 code implementations • 9 Jul 2019 • Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, Yi Zhu
We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating).
However, learning the full extent of pixel-level instance response in a weakly supervised manner remains unexplored.
Ranked #8 on Image-level Supervised Instance Segmentation on PASCAL VOC 2012 val (using extra training data)
While neural dependency parsers provide state-of-the-art accuracy for several languages, they still rely on large amounts of costly labeled training data.
The use of subword-level information (e. g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning.
This paper develops a deep-learning framework to synthesize a ground-level view of a location given an overhead image.
In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks.
Ranked #1 on Semantic Segmentation on CamVid (using extra training data)
More significantly, we show the generated images are representative of the locations and that the representations learned by the cGANs are informative.
Despite the significant progress that has been made on estimating optical flow recently, most estimation methods, including classical and deep learning approaches, still have difficulty with multi-scale estimation, real-time computation, and/or occlusion reasoning.
Motivated by this, we first design a process to stimulate peaks to emerge from a class response map.
Ranked #9 on Image-level Supervised Instance Segmentation on PASCAL VOC 2012 val (using extra training data)
Unseen Action Recognition (UAR) aims to recognise novel action categories without training examples.
Ranked #2 on Action Recognition on ActivityNet
Weakly supervised object localization remains challenging, where only image labels instead of bounding boxes are available during training.
Ranked #2 on Weakly Supervised Object Detection on COCO
State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs.
Ranked #16 on Action Recognition on UCF101
We employ a multi-task learning framework that performs the three highly related steps of action proposal, action recognition, and action localization refinement in parallel instead of the standard sequential pipeline that performs the steps in order.
This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves.