TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs.
Ranked #1 on Multi-Object Tracking on MOT16 (using extra training data)
This paper investigates the problem of impact-time-control and proposes a learning-based computational guidance algorithm to solve this problem.
1 code implementation • 22 Jul 2020 • Brooke E. Husic, Nicholas E. Charron, Dominik Lemm, Jiang Wang, Adrià Pérez, Maciej Majewski, Andreas Krämer, Yaoyi Chen, Simon Olsson, Gianni de Fabritiis, Frank Noé, Cecilia Clementi
5, 755 (2019)] demonstrated that the existence of such a variational limit enables the use of a supervised machine learning framework to generate a coarse-grained force field, which can then be used for simulation in the coarse-grained space.
Using ensemble learning and stratified sampling, we propose a 2-layer training scheme that enables GDML to learn an effective coarse-grained model.
In this paper, we propose the resource constrained differentiable architecture search (RC-DARTS) method to learn architectures that are significantly smaller and faster while achieving comparable accuracy.
We show that AdvProp improves a wide range of models on various image recognition tasks and performs better when the models are bigger.
Ranked #61 on Image Classification on ImageNet
We show that CGnets can capture all-atom explicit-solvent free energy surfaces with models using only a few coarse-grained beads and no solvent, while classical coarse-graining methods fail to capture crucial features of the free energy surface.
Comparing to standard Faster RCNN, it contains three highlights: an ensemble of two classification heads and a distillation head to avoid overfitting on noisy labels and improve the mining precision, masking the negative sample loss in box predictor to avoid the harm of false negative labels, and training box regression head only on seed annotations to eliminate the harm from inaccurate boundaries of mined bounding boxes.
This formulation shows the connection between the standard GAN training process and the primal-dual subgradient methods for convex optimization.
We demonstrate how to approximate kernels such as Gaussian RBF up to a given order using compact explicit feature maps in a parameter-free manner.
By designing a novel reward strategy, we are able to learn to locate regions that are spatially and semantically distinctive with reinforcement learning algorithm.
While deep convolutional neural networks (CNNs) have shown a great success in single-label image classification, it is important to note that real world images generally contain multiple labels, which could correspond to different objects, scenes, actions and attributes in an image.
Fine-grained recognition is challenging due to its subtle local inter-class differences versus large intra-class variations such as poses.
While feedforward deep convolutional neural networks (CNNs) have been a great success in computer vision, it is important to remember that the human visual contex contains generally more feedback connections than foward connections.
ABC-CNN determines an attention map for an image-question pair by convolving the image feature map with configurable convolutional kernels derived from the question's semantics.
We adapt a state-of-the-art semantic image segmentation model, which we jointly train with multi-scale input images and the attention model.
In particular, we propose a transposed weight sharing scheme, which not only improves performance on image captioning, but also makes the model more suitable for the novel concept learning task.
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel image captions.
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images.
We present a novel multiview spatio-temporal AND-OR graph (MST-AOG) representation for cross-view action recognition, i. e., the recognition is performed on the video from an unknown and unseen view.
This paper proposes a deep ranking model that employs deep learning techniques to learn similarity metric directly from images. It has higher learning capability than models based on hand-crafted features.
A robust tracking framework based on the locality sensitive histograms is proposed, which consists of two main components: a new feature for tracking that is robust to illumination changes and a novel multi-region tracking algorithm that runs in realtime even with hundreds of regions.