Language technologies that accurately model the dynamics of events must perform commonsense reasoning.
Results on five difficult question-answering datasets StrategyQA, QuaRel, OpenBookQA, NumerSense and QASC show that not only does MaRio improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline.
In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem.
NORMLENS consists of 10K human judgments accompanied by free-form explanations covering 2K multimodal situations, and serves as a probe to address two questions: (1) to what extent can models align with average human judgment?
These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment.
2 code implementations • 2 Aug 2023 • Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Jenia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, Ludwig Schmidt
We introduce OpenFlamingo, a family of autoregressive vision-language models ranging from 3B to 9B parameters.
Ranked #14 on Visual Question Answering (VQA) on CORE-MM
Surprising videos, e. g., funny clips, creative performances, or visual illusions, attract significant attention.
Our evaluations show that the best model in any given evaluation reaches on average 87% of ChatGPT performance, and 73% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap.
We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e. g., single object, to object+property, to multiple interacting objects).
We release Multimodal C4, an augmentation of the popular text-only C4 corpus with images interleaved.
We introduce WHOOPS!, a new dataset and benchmark for visual commonsense.
Ranked #1 on Image-to-Text Retrieval on WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images (using extra training data)
Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e. g. commonsense graphs , ethical norms ), and larger models like GPT-3 manifest broad commonsense reasoning capacity.
Data scarcity has been a long standing issue in the field of open-domain social dialogue.
To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL.
Large neural networks can now generate jokes, but do they really "understand" humor?
Large-scale language models often learn behaviors that are misaligned with user expectations.
Large language models readily adapt to novel settings, even without task-specific training data.
We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents.
Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet.
Ranked #4 on Action Classification on Kinetics-600 (using extra training data)
We create a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop.
In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2. 2\% R@1.
We apply this to the ATOMIC resource, and share our new symbolic knowledge graph and commonsense models.
As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future.
Image captioning has conventionally relied on reference-based automatic evaluations, where machine captions are compared against captions written by humans.
Ranked #1 on Hallucination Pair-wise Detection (4-ref) on FOIL
Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations.
Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering.
Pretraining from unlabelled web videos has quickly become the de-facto means of achieving high performance on many video understanding tasks.
Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e. g., "heat the oil in the pan") improves user experiences.
Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present.
Controversial posts are those that split the preferences of a community, receiving both significant positive and significant negative feedback.
The content of today's social media is becoming more and more rich, increasingly mixing text, images, videos, and audio.