However, under their blind spot constraints, previous self-supervised video denoising methods suffer from significant information loss and texture destruction in either the whole reference frame or neighbor frames, due to their inadequate consideration of the receptive field.
For video content creation and understanding, the shot boundary detection (SBD) is one of the most essential components in various scenarios.
Ranked #1 on Camera shot boundary detection on ClipShots
Most of these methods fail to achieve realistic reconstruction when only a single image is available.
Recent video text spotting methods usually require the three-staged pipeline, i. e., detecting text in individual images, recognizing localized text, tracking text streams with post-processing to generate final results.
Most existing video text spotting benchmarks focus on evaluating a single language and scenario with limited data.
In this paper, we present TransAug (Translate as Augmentation), which provide the first exploration of utilizing translated sentence pairs as data augmentation for text, and introduce a two-stage paradigm to advances the state-of-the-art sentence embeddings.
Extra rich non-paired single-modal text data is used for boosting the generalization of text branch.
The experiment demonstrates no loss of accuracy when training with only 10\% randomly sampled classes for the softmax-based loss functions, compared with training with full classes using state-of-the-art models on mainstream benchmarks.
Ranked #2 on Face Identification on MegaFace
The 8 bits quantization has been widely applied to accelerate network inference in various deep learning applications.
Hyperspectral image (HSI) recovery from a single RGB image has attracted much attention, whose performance has recently been shown to be sensitive to the camera spectral sensitivity (CSS).