Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.
We empirically show that our sparse Mobile Vision MoEs (V-MoEs) can achieve a better trade-off between performance and efficiency than the corresponding dense ViTs.
Further experiments on zero-shot and linear probe image classification also show that MOFI outperforms a CLIP model trained on the original image-text data, demonstrating the effectiveness of the I2E dataset in learning strong image representations.
In this paper, we discuss two effective approaches to improve the efficiency and robustness of CLIP training: (1) augmenting the training dataset while maintaining the same number of optimization steps, and (2) filtering out samples that contain text regions in the image.
Instead of compressing multiple tasks' knowledge into a single model, MoE separates the parameter space and only utilizes the relevant model pieces given task type and its input, which provides stabilized MTL training and ultra-efficient inference.
This paper summarizes model improvements and inference-time optimizations for the popular anchor-based detectors in the scenes of autonomous driving.
The motivation comes from two pain spots: 1) the lack of efficient and principled methods for designing and scaling ViTs; 2) the tremendous computational cost of training ViT that is much heavier than its convolution counterpart.
In this paper, we study contrastive learning from an optimization perspective, aiming to analyze and address a fundamental issue of existing contrastive learning methods that either rely on a large batch size or a large dictionary of feature vectors.
In this paper, we comprehensively study three architecture design choices on ViT -- spatial reduction, doubled channels, and multiscale features -- and demonstrate that a vanilla ViT architecture can fulfill this goal without handcrafting multiscale features, maintaining the original ViT design philosophy.
The experiments show that the resultant unified foundation transformer works surprisingly well on both the vision-only and text-only tasks, and the proposed knowledge distillation and gradient masking strategy can effectively lift the performance to approach the level of separately-trained models.
A recent work from Bello shows that training and scaling strategies may be more significant than model architectures for visual recognition.
Ranked #38 on Action Classification on Kinetics-600
We benchmark these improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors.
Ranked #53 on Object Detection on COCO minival
Scale-permuted networks have shown promising results on object bounding box detection and instance segmentation.
Ranked #4 on Semantic Segmentation on PASCAL VOC 2012 val
Using improved training and scaling strategies, we design a family of ResNet architectures, ResNet-RS, which are 1. 7x - 2. 7x faster than EfficientNets on TPUs, while achieving similar accuracies on ImageNet.
Ranked #1 on Document Image Classification on AIP
Furthermore, SpineNet is built with a uniform resource distribution over operations.
We propose SpineNet, a backbone with scale-permuted intermediate features and cross-scale connections that is learned on an object detection task by Neural Architecture Search.
Ranked #9 on Image Classification on iNaturalist
In this paper, we introduce the problem of estimating the real world depth of elements in a scene captured by two cameras with different field of views, where the first field of view (FOV) is a Wide FOV (WFOV) captured by a wide angle lens, and the second FOV is contained in the first FOV and is captured by a tele zoom lens.
A stacked atrous multiscale network is proposed to aggregate rich multiscale contextual information from the cost volume which allows for estimating the disparity with high accuracy at multiple scales.
The classification system further classifies the generated candidates based on opinions of multiple deep verification networks and a fusion network which utilizes a novel soft-rejection fusion method to adjust the confidence in the detection results.
Compared to the general semantic segmentation problem, portrait segmentation has higher precision requirement on boundary area.
A single shot deep convolutional network is trained as a object detector to generate all possible pedestrian candidates of different sizes and occlusions.
Ranked #21 on Pedestrian Detection on Caltech