We present VideoCLIP, a contrastive approach to pre-train a unified model for zero-shot video and text understanding, without using any labels on downstream tasks.
Easy-to-use and powerful NLP library with Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including Neural Search, Question Answering, Information Extraction and Sentiment Analysis end-to-end system.
Our key insight is to take advantage of the powerful vision-language model CLIP for supervising neural human generation, in terms of 3D geometry, texture and animation.
The Transformer architecture has improved the performance of deep learning models in domains such as Computer Vision and Natural Language Processing.
When fine-tuning on downstream tasks, a modality-specific adapter is used to introduce the data and tasks' prior information into the model, making it suitable for these tasks.
Ranked #1 on
Semantic Segmentation
on ADE20K val
Optical flow, which captures motion information across frames, is exploited in recent video inpainting methods through propagating pixels along its trajectories.
Ranked #1 on
Video Inpainting
on YouTube-VOS 2018 val
Text-to-image generation has traditionally focused on finding better modeling assumptions for training on a fixed dataset.
Ranked #5 on
Zero-Shot Text-to-Image Generation
on COCO
The confusion matrix, a ubiquitous visualization for helping people evaluate machine learning models, is a tabular layout that compares predicted class labels against actual class labels over all data instances.
Despite its simplicity, benchmark results show our system's note estimation to be substantially better than a comparable baseline, and its frame-level accuracy to be only marginally below those of specialized state-of-the-art AMT systems.
PennyLane is a Python 3 software framework for optimization and machine learning of quantum and hybrid quantum-classical computations.