Search Results for author: Jack Hessel

Found 37 papers, 25 papers with code

Tailoring Self-Rationalizers with Multi-Reward Distillation

1 code implementation6 Nov 2023 Sahana Ramnath, Brihi Joshi, Skyler Hallinan, Ximing Lu, Liunian Harold Li, Aaron Chan, Jack Hessel, Yejin Choi, Xiang Ren

Results on five difficult question-answering datasets StrategyQA, QuaRel, OpenBookQA, NumerSense and QASC show that not only does MaRio improve task accuracy, but it also improves the self-rationalization quality of small LMs across the aforementioned axes better than a supervised fine-tuning (SFT) baseline.

Question Answering StrategyQA

What's "up" with vision-language models? Investigating their struggle with spatial reasoning

no code implementations30 Oct 2023 Amita Kamath, Jack Hessel, Kai-Wei Chang

Recent vision-language (VL) models are powerful, but can they reliably distinguish "right" from "left"?

Personalized Soups: Personalized Large Language Model Alignment via Post-hoc Parameter Merging

1 code implementation17 Oct 2023 Joel Jang, Seungone Kim, Bill Yuchen Lin, Yizhong Wang, Jack Hessel, Luke Zettlemoyer, Hannaneh Hajishirzi, Yejin Choi, Prithviraj Ammanabrolu

In this work, we study Reinforcement Learning from Personalized Human Feedback (RLPHF) problem, wherein LLMs are aligned to multiple (sometimes conflicting) preferences by modeling alignment as a Multi-Objective Reinforcement Learning (MORL) problem.

Language Modelling Large Language Model +2

Reading Books is Great, But Not if You Are Driving! Visually Grounded Reasoning about Defeasible Commonsense Norms

1 code implementation16 Oct 2023 Seungju Han, Junhyeok Kim, Jack Hessel, Liwei Jiang, Jiwan Chung, Yejin Son, Yejin Choi, Youngjae Yu

NORMLENS consists of 10K human judgments accompanied by free-form explanations covering 2K multimodal situations, and serves as a probe to address two questions: (1) to what extent can models align with average human judgment?

VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use

1 code implementation12 Aug 2023 Yonatan Bitton, Hritik Bansal, Jack Hessel, Rulin Shao, Wanrong Zhu, Anas Awadalla, Josh Gardner, Rohan Taori, Ludwig Schmidt

These descriptions enable 1) collecting human-verified reference outputs for each instance; and 2) automatic evaluation of candidate multimodal generations using a text-only LLM, aligning with human judgment.

Instruction Following

FunQA: Towards Surprising Video Comprehension

1 code implementation26 Jun 2023 Binzhu Xie, Sicheng Zhang, Zitang Zhou, Bo Li, Yuanhan Zhang, Jack Hessel, Jingkang Yang, Ziwei Liu

Surprising videos, e. g., funny clips, creative performances, or visual illusions, attract significant attention.

Question Answering Text Generation +3

How Far Can Camels Go? Exploring the State of Instruction Tuning on Open Resources

1 code implementation NeurIPS 2023 Yizhong Wang, Hamish Ivison, Pradeep Dasigi, Jack Hessel, Tushar Khot, Khyathi Raghavi Chandu, David Wadden, Kelsey MacMillan, Noah A. Smith, Iz Beltagy, Hannaneh Hajishirzi

Our evaluations show that the best model in any given evaluation reaches on average 87% of ChatGPT performance, and 73% of GPT-4 performance, suggesting that further investment in building better base models and instruction-tuning data is required to close the gap.

Instruction Following

Text encoders bottleneck compositionality in contrastive vision-language models

1 code implementation24 May 2023 Amita Kamath, Jack Hessel, Kai-Wei Chang

We first curate CompPrompts, a set of increasingly compositional image captions that VL models should be able to capture (e. g., single object, to object+property, to multiple interacting objects).

Image Captioning

Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning

1 code implementation CVPR 2023 Youngjae Yu, Jiwan Chung, Heeseung Yun, Jack Hessel, Jae Sung Park, Ximing Lu, Rowan Zellers, Prithviraj Ammanabrolu, Ronan Le Bras, Gunhee Kim, Yejin Choi

Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e. g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity.

Language Modelling reinforcement-learning +2

The Abduction of Sherlock Holmes: A Dataset for Visual Abductive Reasoning

no code implementations10 Feb 2022 Jack Hessel, Jena D. Hwang, Jae Sung Park, Rowan Zellers, Chandra Bhagavatula, Anna Rohrbach, Kate Saenko, Yejin Choi

We present Sherlock, an annotated corpus of 103K images for testing machine capacity for abductive reasoning beyond literal image contents.

Visual Abductive Reasoning Visual Reasoning

Reframing Human-AI Collaboration for Generating Free-Text Explanations

1 code implementation NAACL 2022 Sarah Wiegreffe, Jack Hessel, Swabha Swayamdipta, Mark Riedl, Yejin Choi

We create a pipeline that combines GPT-3 with a supervised filter that incorporates binary acceptability judgments from humans in the loop.

Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer

1 code implementation NAACL 2022 Yanpeng Zhao, Jack Hessel, Youngjae Yu, Ximing Lu, Rowan Zellers, Yejin Choi

In a difficult zero-shot setting with no paired audio-text data, our model demonstrates state-of-the-art zero-shot performance on the ESC50 and US8K audio classification tasks, and even surpasses the supervised state of the art for Clotho caption retrieval (with audio queries) by 2. 2\% R@1.

Audio Classification Audio Tagging +3

MERLOT: Multimodal Neural Script Knowledge Models

1 code implementation NeurIPS 2021 Rowan Zellers, Ximing Lu, Jack Hessel, Youngjae Yu, Jae Sung Park, Jize Cao, Ali Farhadi, Yejin Choi

As humans, we understand events in the visual world contextually, performing multimodal reasoning across time to make inferences about the past, present, and future.

Visual Commonsense Reasoning

Domain-Specific Lexical Grounding in Noisy Visual-Textual Documents

1 code implementation EMNLP 2020 Gregory Yauney, Jack Hessel, David Mimno

Images can give us insights into the contextual meanings of words, but current image-text grounding approaches require detailed annotations.

Clustering object-detection +1

A Case Study on Combining ASR and Visual Features for Generating Instructional Video Captions

no code implementations CONLL 2019 Jack Hessel, Bo Pang, Zhenhai Zhu, Radu Soricut

Instructional videos get high-traffic on video sharing platforms, and prior work suggests that providing time-stamped, subtask annotations (e. g., "heat the oil in the pan") improves user experiences.

Automatic Speech Recognition Automatic Speech Recognition (ASR) +1

Unsupervised Discovery of Multimodal Links in Multi-image, Multi-sentence Documents

2 code implementations IJCNLP 2019 Jack Hessel, Lillian Lee, David Mimno

Images and text co-occur constantly on the web, but explicit links between images and sentences (or other intra-document textual units) are often not present.


Something's Brewing! Early Prediction of Controversy-causing Posts from Discussion Features

no code implementations NAACL 2019 Jack Hessel, Lillian Lee

Controversial posts are those that split the preferences of a community, receiving both significant positive and significant negative feedback.

Cats and Captions vs. Creators and the Clock: Comparing Multimodal Content to Context in Predicting Relative Popularity

1 code implementation6 Mar 2017 Jack Hessel, Lillian Lee, David Mimno

The content of today's social media is becoming more and more rich, increasingly mixing text, images, videos, and audio.

Image Representations and New Domains in Neural Image Captioning

no code implementations WS 2015 Jack Hessel, Nicolas Savva, Michael J. Wilber

We examine the possibility that recent promising results in automatic caption generation are due primarily to language models.

Image Captioning

Cannot find the paper you are looking for? You can Submit a new open access paper.