Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank.
In this paper, we present TridentSE, a novel architecture for speech enhancement, which is capable of efficiently capturing both global information and local details.
This paper presents OmniVL, a new foundation model to support both image-language and video-language tasks using one universal architecture.
Ranked #1 on Zero-Shot Video Retrieval on MSR-VTT
Continuous Speech Keyword Spotting (CSKWS) is a task to detect predefined keywords in a continuous speech.
In the proposed paradigm, global and local factors in speech are explicitly decomposed and separately manipulated to achieve high speaker similarity and continuous prosody.
We propose to incorporate peripheral position encoding to the multi-head self-attention layers to let the network learn to partition the visual field into diverse peripheral regions given training data.
Last, an innovative link attention module serves as the decoder to reconstruct data from the decomposed content and style, with the help of the linking keys.
It can be even replaced by a zero-parameter operation.
Ranked #66 on Object Detection on COCO minival (APM metric)
Specifically, we replace the MLP module in the token-mixing step with a novel sparse MLP (sMLP) module.
Ranked #326 on Image Classification on ImageNet
Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript.
Convolutional neural networks (CNN) are the dominant deep neural network (DNN) architecture for computer vision.
In this paper, we propose a novel contrastive mask prediction (CMP) task for visual representation learning and design a mask contrast (MaskCo) framework to implement the idea.
The proxy task is to estimate the position and size of the image patch in a sequence of video frames, given only the target bounding box in the first frame.
This paper presents a self-supervised learning framework, named MGF, for general-purpose speech representation learning.
We believe that VAE$^2$ is also applicable to other stochastic sequence prediction problems where training data are lack of stochasticity.
Based on the probability space, we further generate new fusion strategies which achieve the state-of-the-art performance on four well-known action recognition datasets.
Accordingly, a hybrid network representation is presented which enables us to leverage the Variational Dropout so that the approximation of the posterior distribution becomes fully gradient-based and highly efficient.
The two stages are connected in series as the input proposals of the FM stage are generated by the CM stage.
no code implementations • 2 Dec 2018 • Stephen H. Bach, Daniel Rodriguez, Yintao Liu, Chong Luo, Haidong Shao, Cassandra Xia, Souvik Sen, Alexander Ratner, Braden Hancock, Houman Alborzi, Rahul Kuchhal, Christopher Ré, Rob Malkin
Labeling training data is one of the most costly bottlenecks in developing machine learning-based applications.
Recently, Siamese network based trackers have received tremendous interest for their fast tracking speed and high performance.
Ranked #9 on Visual Object Tracking on VOT2017/18