Most graph-to-text works are built on the encoder-decoder framework with cross-attention mechanism.
Texts in scene images convey critical information for scene understanding and reasoning.
However, it is hard for a vanilla encoder to capture these.
Ranked #1 on Table-to-Text Generation on RotoWire
Secondly, the target texts in training dataset may contain redundant information or facts do not exist in the input tables.
Modern object detection methods based on convolutional neural network suffer from severe catastrophic forgetting in learning new classes without original data.
As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning.
Ranked #8 on Self-supervised Video Retrieval on HMDB51