Cross-modality interaction is a critical component in Text-Video Retrieval (TVR), yet there has been little examination of how different influencing factors for computing interaction affect performance.
Ranked #3 on Video Retrieval on MSR-VTT-1kA (using extra training data)
Recent works have shown that convolutional networks have substantially improved the performance of multiple object tracking by simultaneously learning detection and appearance features.
First, we adopt a simple but effective decoupled learning strategy of representations and classifiers that only the classifiers are updated in each incremental session, which avoids knowledge forgetting in the representations.
Nowadays, live-stream and short video shopping in E-commerce have grown exponentially.
In this paper, we present a novel side information based large scale visual recognition co-training~(SICoT) system to deal with the long tail problem by leveraging the image related side information.
For a deployed visual search system with several billions of online images in total, building a billion-scale offline graph in hours is essential, which is almost unachievable by most existing methods.
Benefiting from exploration of user click data, our networks are more effective to encode richer supervision and better distinguish real-shot images in terms of category and feature.
In many real-world datasets, like WebVision, the performance of DNN based classifier is often limited by the noisy labeled data.
The LaBr$_3$(Ce) detector has attracted much attention in recent years for its superior characteristics to other scintillating materials in terms of resolution and efficiency.
Instrumentation and Detectors Nuclear Experiment