In PromptPose, we propose that adapting the language knowledge to the visual animal poses is key to achieve effective animal pose estimation.
To track the target in a video, current visual trackers usually adopt greedy search for target object localization in each frame, that is, the candidate region with the maximum response score will be selected as the tracking result of each frame.
When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks.
Ranked #1 on Semantic Segmentation on Cityscapes test (using extra training data)
Point cloud segmentation is fundamental in understanding 3D environments.
Ranked #4 on Semantic Segmentation on S3DIS Area5
However, methods based on this technique ignore the pressure on a single transformation matrix due to the complex information contained in the data.
We observe that the prevailing set abstraction design for down-sampling points may maintain too much unimportant background information that can affect feature learning for detecting objects.
Then, a glimpse-based decoder is introduced to provide refined detection results based on both the glimpse features and the attention modeling outputs of the previous stage.
Whereas adversarial training can be useful against specific adversarial perturbations, they have also proven ineffective in generalizing towards attacks deviating from those used for training.
Crucial for healthcare and biomedical applications, respiration monitoring often employs wearable sensors in practice, causing inconvenience due to their direct contact with human bodies.
We propose an accurate and efficient scene text detection framework, termed FAST (i. e., faster arbitrarily-shaped text detector).
Ranked #2 on Scene Text Detection on SCUT-CTW1500
Radio-Frequency (RF) based device-free Human Activity Recognition (HAR) rises as a promising solution for many applications.
Given the significant amount of time people spend in vehicles, health issues under driving condition have become a major concern.
Dropout has been commonly used to quantify prediction uncertainty, i. e, the variations of model predictions on a given input example.
In many practical scenarios of signal extraction from a nonlinear mixture, only one (signal) source is intended to be extracted.
To this end, we propose to decompose each video into a series of expression snippets, each of which contains a small number of facial movements, and attempt to augment the Transformer's ability for modeling intra-snippet and inter-snippet visual relations, respectively, obtaining the Expression snippet Transformer (EST).
Different from visible cameras which record intensity images frame by frame, the biologically inspired event camera produces a stream of asynchronous and sparse events with much lower latency.
Ranked #1 on Object Tracking on VisEvent
In this paper, we propose to introduce more dynamics by devising a dynamic attention-guided multi-trajectory tracking strategy.
no code implementations • 30 Mar 2021 • Florian Laurent, Manuel Schneider, Christian Scheller, Jeremy Watson, Jiaoyang Li, Zhe Chen, Yi Zheng, Shao-Hung Chan, Konstantin Makhnev, Oleg Svidchenko, Vladimir Egorov, Dmitry Ivanov, Aleksei Shpilman, Evgenija Spirovska, Oliver Tanevski, Aleksandar Nikov, Ramon Grunder, David Galevski, Jakov Mitrovski, Guillaume Sartoretti, Zhiyao Luo, Mehul Damani, Nilabha Bhattacharya, Shivam Agarwal, Adrian Egli, Erik Nygren, Sharada Mohanty
However, the coordination of hundreds of agents in a real-life setting like a railway network remains challenging and the Flatland environment used for the competition models these real-world properties in a simplified manner.
(1) We divide input image into small patches and adopt TIN, successfully transferring image style with arbitrary high-resolution.
Action recognition, which is formulated as a task to identify various human actions in a video, has attracted increasing interest from computer vision researchers due to its importance in various applications.
In this paper, we give a mathematical formalization of Multi-Agent Path Finding for Car-Like robots (CL-MAPF) problem.
Robotics Multiagent Systems
Despite deep neural network (DNN)'s impressive prediction performance in various domains, it is well known now that a set of DNN models trained with the same model specification and the same data can produce very different prediction results.
We introduce a novel neural network-based BRDF model and a Bayesian framework for object inverse rendering, i. e., joint estimation of reflectance and natural illumination from a single image of an object of known geometry.
This method can act as a plug-in for Fast Style Transfer without any modification to the network architecture.
Accurate knowledge of the distribution system topology and parameters is required to achieve good voltage controls, but this is difficult to obtain in practice.
We extend the classical result asserting that the twisting operator preserves certain Deligne--Lusztig character values for truncated formal power series; along the way we discuss some properties of centralisers.
This paper proposes a data-driven distributed voltage control approach based on the spectrum clustering and the enhanced multi-agent deep reinforcement learning (MADRL) algorithm.
More specifically, we propose to perceive texts from three levels of feature representations, i. e., character-, word- and global-level, and then introduce a novel text representation fusion technique to help achieve robust arbitrary text detection.
Ranked #1 on Scene Text Detection on ICDAR 2015
Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination and scale variance.
Ranked #5 on Pose Estimation on COCO test-dev
Alternatively, to access much more natural-looking pedestrians, we propose to augment pedestrian detection datasets by transforming real pedestrians from the same dataset into different shapes.
Human keypoint detection from a single image is very challenging due to occlusion, blur, illumination and scale variance of person instances.
Only learning one projection matrix from original samples to the corresponding binary labels is too strict and will consequentlly lose some intrinsic geometric structures of data.
To solve above problems, we propose a low-rank discriminative least squares regression model (LRDLSR) for multi-class image classification.
On one hand, the Fisher criterion improves the intra-class compactness of the relaxed labels during relaxation learning.
In this paper, we propose a non-negative representation based discriminative dictionary learning algorithm (NRDL) for multicategory face classification.
Current two-stage object detectors, which consists of a region proposal stage and a refinement stage, may produce unreliable results due to ill-localized proposed regions.
We find that further improvements for correlation filter-based tracking can be made on estimating scales, applying part-based tracking strategy and cooperating with long-term tracking methods.
Variations in the appearance of a tracked object, such as changes in geometry/photometry, camera viewpoint, illumination, or partial occlusion, pose a major challenge to object tracking.
In this paper, a nonparametric maximum likelihood (ML) estimator for band-limited (BL) probability density functions (pdfs) is proposed.
Rodent hippocampal population codes represent important spatial information about the environment during navigation.