Deflation is a video-to-image operation to transform a video network into a network that can ingest a single image. In the two types of video networks considered in the original paper, this deflation corresponds to the following operations: for 3D convolutional based networks, summing the 3D spatio-temporal filters over the temporal dimension to obtain 2D filters; for TSM networks,, turning off the channel shifting which results in a standard residual architecture (ResNet50) for images.
Source: Self-Supervised MultiModal Versatile NetworksPaper | Code | Results | Date | Stars |
---|
Task | Papers | Share |
---|---|---|
blind source separation | 1 | 25.00% |
Action Recognition In Videos | 1 | 25.00% |
Audio Classification | 1 | 25.00% |
Self-Supervised Action Recognition | 1 | 25.00% |
Component | Type |
|
---|---|---|
🤖 No Components Found | You can add them if they exist; e.g. Mask R-CNN uses RoIAlign |