Existing deepfake speech detection systems lack generalizability to unseen attacks (i. e., samples generated by generative algorithms not seen during training).
This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE.
To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence.
After feeding the input sentence into the encoder of paraphrase modeling, we generate the substitutes based on a novel decoding strategy that concentrates solely on the lexical variations of the complex word.
We conduct empirical studies on two datasets: N-body MNIST, a synthetic dataset with chaotic behavior, and SEVIR, a real-world precipitation nowcasting dataset.
In this paper, we propose a deep learning framework for solving high-dimensional partial integro-differential equations (PIDEs) based on the temporal difference learning.
This paper investigates the challenges of applying vision-language models (VLMs) to zero-shot visual recognition tasks in an open-world setting, with a focus on contrastive vision-language models such as CLIP.
Clickbait, which aims to induce users with some surprising and even thrilling headlines for increasing click-through rates, permeates almost all online content publishers, such as news portals and social media.
The inference of the DL model is performed on a low-power microcontroller in the central node.
It has been commonly observed that a teacher model with superior performance does not necessarily result in a stronger student, highlighting a discrepancy between current teacher training practices and effective knowledge transfer.
1 code implementation • 26 Apr 2023 • Bingqian Lin, Zicong Chen, Mingjie Li, Haokun Lin, Hang Xu, Yi Zhu, Jianzhuang Liu, Wenjia Cai, Lei Yang, Shen Zhao, Chenfei Wu, Ling Chen, Xiaojun Chang, Yi Yang, Lei Xing, Xiaodan Liang
In MOTOR, we combine two kinds of basic medical knowledge, i. e., general and specific knowledge, in a complementary manner to boost the general pretraining process.
This work proposes POMP, a prompt pre-training method for vision-language models.
Ranked #1 on Open Vocabulary Semantic Segmentation on PascalVOC-20 (hIoU metric)
With advances seen in deep learning, voice-based applications are burgeoning, ranging from personal assistants, affective computing, to remote disease diagnostics.
Vision-Language Navigation (VLN) is a challenging task which requires an agent to align complex visual observations to language instructions to reach the goal position.
Geospatial technologies are becoming increasingly essential in our world for a wide range of applications, including agriculture, urban planning, and disaster response.
Instead of purely relying on the alignment from the noisy data, this paper proposes a novel loss function termed SimCon, which accounts for intra-modal similarities to determine the appropriate set of positive samples to align.
Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks.
Ranked #1 on Action Recognition on Diving-48 (using extra training data)
Specifically, we first propose text-to-views consistency modeling to learn correspondence for multiple views of the same input image.
SuperScaler is a system that facilitates the design and generation of highly flexible parallelization plans.
The architecture of transformers, which recently witness booming applications in vision tasks, has pivoted against the widespread convolutional paradigm.
Pre-trained large language models can efficiently interpolate human-written prompts in a natural way.
Multimodal image-text models have shown remarkable performance in the past few years.
Then in the Sentence-Mask Alignment (SMA) module, the masks are weighted by the sentence embedding to localize the referred object, and finally projected back to aggregate the pixels for the target.
In a validation using a public dataset, the prototype developed achieved a FoG detection sensitivity of 88. 8% and an F1 score of 85. 34%, using less than 20 k trainable parameters per sensor node.
Inspired by the success of vision-language methods (VLMs) in zero-shot classification, recent works attempt to extend this line of work into object detection by leveraging the localization ability of pre-trained VLMs and generating pseudo labels for unseen classes in a self-training manner.
First, DePT plugs visual prompts into the vision Transformer and only tunes these source-initialized prompts during adaptation.
Ranked #2 on Domain Adaptation on VisDA2017
Taking the collocations of Gaussian functions as the test functions in the weak form of the FP equation, we transfer the derivatives to the Gaussian functions and thus approximate the weak form by the expectational sum of the data.
With the explosive growth of the spatiotemporal Earth observation data in the past decade, data-driven models that apply Deep Learning (DL) are demonstrating impressive potential for various Earth system forecasting tasks.
Ranked #1 on Earth Surface Forecasting on EarthNet2021 OOD Track
In this paper, we show that recent advances in self-supervised feature learning enable unsupervised object discovery and semantic segmentation with a performance that matches the state of the field on supervised semantic segmentation 10 years ago.
While self-supervised learning has enabled effective representation learning in the absence of labels, for vision, video remains a relatively untapped source of supervision.
However, in real-world applications, it is common for the training sets to have long-tailed distributions.
Vision-Language Navigation (VLN) is a challenging task that requires an embodied agent to perform action-level modality alignment, i. e., make instruction-asked actions sequentially in complex visual environments.
2 code implementations • 11 May 2022 • Yawei Li, Kai Zhang, Radu Timofte, Luc van Gool, Fangyuan Kong, Mingxi Li, Songwei Liu, Zongcai Du, Ding Liu, Chenhui Zhou, Jingyi Chen, Qingrui Han, Zheyuan Li, Yingqi Liu, Xiangyu Chen, Haoming Cai, Yu Qiao, Chao Dong, Long Sun, Jinshan Pan, Yi Zhu, Zhikai Zong, Xiaoxiao Liu, Zheng Hui, Tao Yang, Peiran Ren, Xuansong Xie, Xian-Sheng Hua, Yanbo Wang, Xiaozhong Ji, Chuming Lin, Donghao Luo, Ying Tai, Chengjie Wang, Zhizhong Zhang, Yuan Xie, Shen Cheng, Ziwei Luo, Lei Yu, Zhihong Wen, Qi Wu1, Youwei Li, Haoqiang Fan, Jian Sun, Shuaicheng Liu, Yuanfei Huang, Meiguang Jin, Hua Huang, Jing Liu, Xinjian Zhang, Yan Wang, Lingshun Long, Gen Li, Yuanfan Zhang, Zuowei Cao, Lei Sun, Panaetov Alexander, Yucong Wang, Minjie Cai, Li Wang, Lu Tian, Zheyuan Wang, Hongbing Ma, Jie Liu, Chao Chen, Yidong Cai, Jie Tang, Gangshan Wu, Weiran Wang, Shirui Huang, Honglei Lu, Huan Liu, Keyan Wang, Jun Chen, Shi Chen, Yuchun Miao, Zimo Huang, Lefei Zhang, Mustafa Ayazoğlu, Wei Xiong, Chengyi Xiong, Fei Wang, Hao Li, Ruimian Wen, Zhijing Yang, Wenbin Zou, Weixin Zheng, Tian Ye, Yuncheng Zhang, Xiangzhen Kong, Aditya Arora, Syed Waqas Zamir, Salman Khan, Munawar Hayat, Fahad Shahbaz Khan, Dandan Gaoand Dengwen Zhouand Qian Ning, Jingzhu Tang, Han Huang, YuFei Wang, Zhangheng Peng, Haobo Li, Wenxue Guan, Shenghua Gong, Xin Li, Jun Liu, Wanjun Wang, Dengwen Zhou, Kun Zeng, Hanjiang Lin, Xinyu Chen, Jinsheng Fang
The aim was to design a network for single image super-resolution that achieved improvement of efficiency measured according to several metrics including runtime, parameters, FLOPs, activations, and memory consumption while at least maintaining the PSNR of 29. 00dB on DIV2K validation set.
Turn-taking, aiming to decide when the next speaker can start talking, is an essential component in building human-robot spoken dialogue systems.
Idioms, are a kind of idiomatic expression in Chinese, most of which consist of four Chinese characters.
This work harnesses interpretable machine learning methods to address the challenging inverse design problem of origami-inspired systems.
Conventional 3D object detection approaches concentrate on bounding boxes representation learning with several parameters, i. e., localization, dimension, and orientation.
Multiple datasets and open challenges for object detection have been introduced in recent years.
Ranked #1 on Object Detection on BigDetection val
Building Spoken Language Understanding (SLU) robust to Automatic Speech Recognition (ASR) errors is an essential issue for various voice-enabled virtual assistants.
In the short text, the extremely short length, feature sparsity, and high ambiguity pose huge challenges to classification tasks.
With this framework as a tool, we propose a correlative covariation projection (CCP) method by using an explicit nonlinear mapping.
The vision-language navigation (VLN) task requires an agent to reach a target with the guidance of natural language instruction.
Contrastive learning allows us to flexibly define powerful losses by contrasting positive pairs from sets of negative samples.
The availability of parallel sentence simplification (SS) is scarce for neural SS modelings.
Recognizing and localizing objects in the 3D space is a crucial ability for an AI agent to perceive its surrounding environment.
Our formulation is able to capture global context in a video, thus robust to temporal content change.
Semantic segmentation is a challenging problem due to difficulties in modeling context in complex scenes and class confusions along boundaries.
Specifically, we propose a Dynamic Reinforced Instruction Attacker (DR-Attacker), which learns to mislead the navigator to move to the wrong target by destroying the most instructive information in instructions at different timesteps.
We introduce a system for optimal resource allocation that can predict performance with aggressive trade-offs, for both new and past observed queries.
A navigation agent is supposed to have various intelligent skills, such as visual perceiving, mapping, planning, exploring and reasoning, etc.
Neural network-based semantic segmentation has achieved remarkable results when large amounts of annotated data are available, that is, in the supervised case.
Previous works study the adversarial robustness of image classifiers on image level and use all the pixel information in an image indiscriminately, lacking of exploration of regions with different semantic meanings in the pixel space of an image.
To better exploit the intrinsic structure of the target domain, we propose Domain Consensus Clustering (DCC), which exploits the domain consensus knowledge to discover discriminative clusters on both common samples and private ones.
Ranked #3 on Partial Domain Adaptation on Office-31
We first introduce the vanilla video transformer and show that transformer module is able to perform spatio-temporal modeling from raw pixels, but with heavy memory usage.
Ranked #14 on Action Classification on Charades
In this task, an agent is required to navigate from an arbitrary position in a 3D embodied environment to localize a target following a scene description.
Numerical simulations first illustrate the consistency of theoretical results on the sharp interface limit.
Numerical Analysis Numerical Analysis 76Z99, 92B05, 76R50
Weyl points are degenerate points on the spectral bands at which energy bands intersect conically.
Mathematical Physics Mathematical Physics Spectral Theory
Semi-supervised learning through deep generative models and multi-lingual pretraining techniques have orchestrated tremendous success across different areas of NLP.
CrossNorm exchanges styles between feature channels to perform style augmentation, diversifying the content and style mixtures.
Vision-Dialog Navigation (VDN) requires an agent to ask questions and navigate following the human responses to find target objects.
Few-shot crosslingual transfer has been shown to outperform its zero-shot counterpart with pretrained encoders like multilingual BERT.
The global existence of solutions to incompressible viscoelastic flows has been a longstanding open problem, even for the global weak solution.
Analysis of PDEs 76A10, 76D03, 35B65
In the world of action recognition research, one primary focus has been on how to construct and train networks to model the spatial-temporal volume of an input video.
Video action recognition is one of the representative tasks for video understanding.
Land-cover classification using remote sensing imagery is an important Earth observation task.
The vulnerability of deep neural networks (DNNs) to adversarial attack, which is an attack that can mislead state-of-the-art classifiers into making an incorrect classification with high confidence by deliberately perturbing the original inputs, raises concerns about the robustness of DNNs to such attacks.
Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning, to simplify the sentence.
The onset of hydrodynamic instabilities is of great importance in both industry and daily life, due to the dramatic mechanical and thermodynamic changes for different types of flow motions.
Due to the intrinsic complexity and nonlinearity of chemical reactions, direct applications of traditional machine learning algorithms may face with many difficulties.
Finally, we demonstrate that adversarial training with SAGE augmented data can improve performance and robustness of TableQA systems.
To derive the hidden dynamics from observed data is one of the fundamental but also challenging problems in many different fields.
In the case of semantic segmentation, this means that large amounts of pixelwise annotations are required to learn accurate models.
It is well known that featuremap attention and multi-path representation are important for visual recognition.
Ranked #8 on Instance Segmentation on COCO test-dev (APM metric)
Most of previous work on adversarial attack mainly focus on image models, while the vulnerability of video models is less explored.
Benefiting from the collaborative learning of the L-mem and the V-mem, our CMN is able to explore the memory about the decision making of historical navigation actions which is for the current step.
Inspired by this, we investigate methods to inform or guide deep learning models for geospatial image analysis to increase their performance when a limited amount of training data is available or when they are applied to scenarios other than which they were trained on.
In this paper, we introduce Auxiliary Reasoning Navigation (AuxRN), a framework with four self-supervised auxiliary reasoning tasks to take advantage of the additional training signals derived from the semantic information.
Ranked #13 on Vision and Language Navigation on VLN Challenge
In this paper, we study a simple algorithm to construct asymptotically valid confidence regions for model parameters using the batch means method.
We investigate statistical uncertainty quantification for reinforcement learning (RL) and its implications in exploration policy.
Recent work has validated the importance of subword information for word representation learning.
Domain adaptation aims to exploit the knowledge in source domain to promote the learning tasks in target domain, which plays a critical role in real-world applications.
Motivated by our observation that motion information is the key to good anomaly detection performance in video, we propose a temporal augmented network to learn a motion-aware feature.
Lexical simplification (LS) aims to replace complex words in a given sentence with their simpler alternatives of equivalent meaning.
4 code implementations • 9 Jul 2019 • Jian Guo, He He, Tong He, Leonard Lausen, Mu Li, Haibin Lin, Xingjian Shi, Chenguang Wang, Junyuan Xie, Sheng Zha, Aston Zhang, Hang Zhang, Zhi Zhang, Zhongyue Zhang, Shuai Zheng, Yi Zhu
We present GluonCV and GluonNLP, the deep learning toolkits for computer vision and natural language processing based on Apache MXNet (incubating).
However, learning the full extent of pixel-level instance response in a weakly supervised manner remains unexplored.
Ranked #10 on Image-level Supervised Instance Segmentation on PASCAL VOC 2012 val (using extra training data)
While neural dependency parsers provide state-of-the-art accuracy for several languages, they still rely on large amounts of costly labeled training data.
The use of subword-level information (e. g., characters, character n-grams, morphemes) has become ubiquitous in modern word representation learning.
This paper develops a deep-learning framework to synthesize a ground-level view of a location given an overhead image.
In this paper, we present a video prediction-based methodology to scale up training sets by synthesizing new training samples in order to improve the accuracy of semantic segmentation networks.
Ranked #1 on Semantic Segmentation on KITTI Semantic Segmentation (using extra training data)
More significantly, we show the generated images are representative of the locations and that the representations learned by the cGANs are informative.
Despite the significant progress that has been made on estimating optical flow recently, most estimation methods, including classical and deep learning approaches, still have difficulty with multi-scale estimation, real-time computation, and/or occlusion reasoning.
Nonetheless, using the new treebank, we build a pipeline system to parse raw tweets into UD.
Ranked #2 on Dependency Parsing on Tweebank
Motivated by this, we first design a process to stimulate peaks to emerge from a class response map.
Ranked #11 on Image-level Supervised Instance Segmentation on PASCAL VOC 2012 val (using extra training data)
Unseen Action Recognition (UAR) aims to recognise novel action categories without training examples.
Ranked #13 on Action Recognition on ActivityNet
Weakly supervised object localization remains challenging, where only image labels instead of bounding boxes are available during training.
Ranked #2 on Weakly Supervised Object Detection on COCO
This notebook paper describes our system for the untrimmed classification task in the ActivityNet challenge 2016.
State-of-the-art action recognition approaches rely on traditional optical flow estimation methods to pre-compute motion information for CNNs.
Ranked #18 on Action Recognition on UCF101
We investigate the problem of representing an entire video using CNN features for human action recognition.
We employ a multi-task learning framework that performs the three highly related steps of action proposal, action recognition, and action localization refinement in parallel instead of the standard sequential pipeline that performs the steps in order.
This paper performs the first investigation into depth for large-scale human action recognition in video where the depth cues are estimated from the videos themselves.