Natural language generation technology has recently seen remarkable progress with large-scale training, and many natural language applications are now built upon a wide range of generation models.
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Ranked #1 on Action Classification on Kinetics-600 (using extra training data)
no code implementations • 16 Dec 2021 • Ximing Lu, Sean Welleck, Peter West, Liwei Jiang, Jungo Kasai, Daniel Khashabi, Ronan Le Bras, Lianhui Qin, Youngjae Yu, Rowan Zellers, Noah A. Smith, Yejin Choi
To enable constrained generation, we build on NeuroLogic decoding (Lu et al., 2021), combining its flexibility in incorporating logical constraints with A*esque estimates of future constraint satisfaction.
In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2. 2\% R@1.
It remains an open question whether incorporating external knowledge benefits commonsense reasoning while maintaining the flexibility of pretrained sequence models.
We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future.
Despite recent advances in natural language generation, it remains challenging to control attributes of generated text.
Despite considerable advancements with deep neural language models (LMs), neural text generation still suffers from degeneration: the generated text is repetitive, generic, self-contradictory, and often lacks commonsense.
Our results show that commonsense knowledge models can rapidly adapt from limited examples, indicating that KG fine-tuning serves to learn an interface to encoded knowledge learned during pretraining.
While the dominant recipe for conditional text generation has been large-scale pretrained language models that are finetuned on the task-specific training data, such models do not learn to follow the underlying constraints reliably, even when supervised with large amounts of task-specific examples.
In this paper, we present Reflective Decoding, a novel unsupervised algorithm that allows for direct application of unidirectional LMs to non-sequential tasks.
HATNet extends the bag-of-words approach and uses self-attention to encode global information, allowing it to learn representations from clinically relevant tissue structures without any explicit supervision.