In this paper, we focus on teasing out what parts of the language supervision are essential for training zero-shot image classification models.
Traditional computer vision models are trained to predict a fixed set of predefined categories.
To mitigate this problem, we propose a teacher-student framework named Adaptive Teacher (AT) which leverages domain adversarial learning and weak-strong data augmentation to address the domain gap.
To tackle these challenges, we propose FBNetV5, a NAS framework that can search for neural architectures for a variety of vision tasks with much reduced computational cost and human effort.
Ranked #87 on Semantic Segmentation on ADE20K
Neural architecture search (NAS) methods aim to automatically find the optimal deep neural network (DNN) architecture as measured by a given objective function, typically some combination of task accuracy and inference efficiency.
This enables the student model to capture domain-invariant features.
In particular, when transferred to PiT, our scaling strategies lead to a boosted ImageNet top-1 accuracy of from $74. 6\%$ to $76. 7\%$ ($\uparrow2. 1\%$) under the same 0. 7G FLOPs; and when transferred to the COCO object detection task, the average precision is boosted by $\uparrow0. 7\%$ under a similar throughput on a V100 GPU.
The resulting family of pruned models can consistently obtain better performance than existing FBNetV3 models at the same level of computation, and thus provide state-of-the-art results when trading off between computational complexity and generalization performance on the ImageNet benchmark.
We discover that we can indeed use the same architecture and pretrained weights of a neural net model to understand both images and point-clouds.
Our model transfers knowledge from pretrained image and sentence encoders and achieves strong performance with only 3M image text pairs, 133x smaller than CLIP.
To address this, we propose a novel self-supervised learning task, which we named Trajectory Contrastive Learning (TCL), to improve meta-training.
To address this, we introduce Unbiased Teacher, a simple yet effective approach that jointly trains a student and a gradually progressing teacher in a mutually-beneficial manner.
A recent trend in computer vision is to replace convolutions with transformers.
More efficient variants of FBWave can achieve up to 109x fewer MACs while still delivering acceptable audio quality.
Furthermore, to search fast in the multi-variate space, we propose a coarse-to-fine strategy by using a factorized distribution at the beginning which can reduce the number of architecture parameters by over an order of magnitude.
They require prior knowledge of real-world statistics and ignore the pixel-level dropout noise gap and the spatial feature gap between different domains.
1 code implementation • 1 Sep 2020 • Sicheng Zhao, Xiangyu Yue, Shanghang Zhang, Bo Li, Han Zhao, Bichen Wu, Ravi Krishna, Joseph E. Gonzalez, Alberto L. Sangiovanni-Vincentelli, Sanjit A. Seshia, Kurt Keutzer
To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled or unlabeled target domain.
Deploying deep learning models on embedded systems has been challenging due to limited computing resources.
In this work, we challenge this paradigm by (a) representing images as semantic visual tokens and (b) running transformers to densely model token relationships.
To address this, we present Neural Architecture-Recipe Search (NARS) to search both (a) architectures and (b) their corresponding training recipes, simultaneously.
Ranked #2 on Neural Architecture Search on ImageNet
We propose a masking mechanism for feature map reuse, so that memory and computational costs stay nearly constant as the search space expands.
Ranked #44 on Neural Architecture Search on ImageNet
Using standard convolutions to process such LiDAR images is problematic, as convolution filters pick up local features that are only active in specific regions in the image.
Ranked #12 on 3D Semantic Segmentation on SemanticKITTI
In this work, we first investigate the overhead of the deformable convolution on embedded FPGA SoCs, and then show the accuracy-latency tradeoffs for a set of algorithm modifications including full versus depthwise, fixed-shape, and limited-range.
Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech.
Sound Audio and Speech Processing
2) One-click annotation: Instead of drawing 3D bounding boxes or point-wise labels, we simplify the annotation to just one click on the target object, and automatically generate the bounding box for the target.
We formulate platform-aware NN architecture search in an optimization framework and propose a novel algorithm to search for optimal architectures aided by efficient accuracy and resource (latency and/or energy) predictors.
Due to this, previous neural architecture search (NAS) methods are computationally expensive.
Ranked #538 on Image Classification on ImageNet
Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources.
DiracDeltaNet achieves competitive accuracy on ImageNet (88. 7\% top-5), but with 42$\times$ fewer parameters and 48$\times$ fewer OPs than VGG16.
When training our new model on synthetic data using the proposed domain adaptation pipeline, we nearly double test accuracy on real-world data, from 29. 0% to 57. 4%.
Ranked #21 on 3D Semantic Segmentation on SemanticKITTI
The framework supports data collection from both auto-driving scenes and user-configured scenes.
To cope with limited labeled training data, many have attempted to directly apply models trained on a large-scale labeled source domain to another sparsely labeled target domain.
One of the main barriers for deploying neural networks on embedded systems has been large memory and power consumption of existing neural networks.
Neural networks rely on convolutions to aggregate spatial information.
In this paper, we address semantic segmentation of road-objects from 3D LiDAR point clouds.
Ranked #23 on 3D Semantic Segmentation on SemanticKITTI
In addition to requiring high accuracy to ensure safety, object detection for autonomous driving also requires real-time inference speed to guarantee prompt vehicle control, as well as small model size and energy efficiency to enable embedded system deployment.
The ability to automatically detect other vehicles on the road is vital to the safety of partially-autonomous and fully-autonomous vehicles.