However, a notable limitation of these approaches is that they do not fully utilize the multi-view attributes inherent in 3D point clouds, which is crucial for a deeper understanding of 3D structures.
The MoDAR modality propagates object information from temporal contexts to a target frame, represented as a set of virtual points, one for each object from a waypoint on a forecasted trajectory.
Foundation models have achieved remarkable results in 2D and language tasks like image segmentation, object detection, and visual-language understanding.
Through systematic analysis, we found that the commonly used pairwise affinity loss has two limitations: (1) it works with color affinity but leads to inferior performance with other modalities such as depth gradient, (2)the original affinity loss does not prevent trivial predictions as intended but actually accelerates this process due to the affinity loss term being symmetric.
We construct our stereo depth estimation model, Context Enhanced Stereo Transformer (CSTR), by plugging CEP into the state-of-the-art stereo depth estimation method Stereo Transformer.
An edge in the graph encodes the relative distance information between a pair of target and reference objects.
no code implementations • 8 Jun 2022 • Longlong Jing, Ruichi Yu, Henrik Kretzschmar, Kang Li, Charles R. Qi, Hang Zhao, Alper Ayvaci, Xu Chen, Dillon Cower, Yingwei Li, Yurong You, Han Deng, CongCong Li, Dragomir Anguelov
Monocular image-based 3D perception has become an active research area in recent years owing to its applications in autonomous driving.
Specifically, our modifications in Fast AdvProp are guided by the hypothesis that disentangled learning with adversarial examples is the key for performance improvements, while other training recipes (e. g., paired clean and adversarial training samples, multi-step adversarial attackers) could be largely simplified.
By swapping the context object features, the model reliance on context can be suppressed effectively.
In this paper, we propose two novel techniques: InverseAug that inverses geometric-related augmentations, e. g., rotation, to enable accurate geometric alignment between lidar points and image pixels, and LearnableAlign that leverages cross-attention to dynamically capture the correlations between image and lidar features during fusion.
Our method achieves the state-of-the-art performance on three video action recognition benchmarks (i. e., Kinetics-400, UCF-101, and HMDB-51) under several typical semi-supervised settings (i. e., different ratios of labeled data).
In order to effectively search in this huge architecture space, we propose Hierarchical Sampling for better training of the supernet.
Crowd counting, which has been widely adopted for estimating the number of people in safety-critical scenes, is shown to be vulnerable to adversarial examples in the physical world (e. g., adversarial patches).
Although deep neural networks have been a dominant method for many 2D vision tasks, it is still challenging to apply them to 3D tasks, such as medical image segmentation, due to the limited amount of annotated 3D data and limited computational resources.
To prevent models from exclusively attending on a single cue in representation learning, we augment training data with images with conflicting shape and texture information (eg, an image of chimpanzee shape but with lemon texture) and, most importantly, provide the corresponding supervisions from shape and texture simultaneously.
Ranked #578 on Image Classification on ImageNet
However, it has been rarely explored to embed the NL blocks in mobile neural networks, mainly due to the following challenges: 1) NL blocks generally have heavy computation cost which makes it difficult to be applied in applications where computational resources are limited, and 2) it is an open problem to discover an optimal configuration to embed NL blocks into mobile neural networks.
Ranked #60 on Neural Architecture Search on ImageNet
3D Convolution Neural Networks (CNNs) have been widely applied to 3D scene understanding, such as video analysis and volumetric image recognition.
In order to understand the impact of adversarial attacks on depth estimation, we first define a taxonomy of different attack scenarios for depth estimation, including non-targeted attacks, targeted attacks and universal attacks.
We propose a fine-grained search space comprised of atomic blocks, a minimal search unit that is much smaller than the ones used in recent NAS algorithms.
Ranked #61 on Neural Architecture Search on ImageNet
Pancreatic ductal adenocarcinoma (PDAC) is one of the most lethal cancers with an overall five-year survival rate of 8%.
Trauma is the worldwide leading cause of death and disability in those younger than 45 years, and pelvic fractures are a major source of morbidity and mortality.
We observe the property of regional homogeneity in adversarial perturbations and suggest that the defenses are less robust to regionally homogeneous perturbations.
However, our work observes the extreme vulnerability of existing distance metrics to adversarial examples, generated by simply adding human-imperceptible perturbations to person images.
The critical principle of ghost networks is to apply feature-level perturbations to an existing model to potentially create a huge set of diverse models.
To account for long-range inhomogeneous dynamics, a VLAD descriptor is derived for the LDS and pooled over the whole video, to arrive at the final VLAD^3 representation.