Search Results for author: Yilun Zhao

Found 28 papers, 18 papers with code

FinMath: Injecting a Tree-structured Solver for Question Answering over Financial Reports

no code implementations • LREC 2022 • Chenying Li, Wenbo Ye, Yilun Zhao

This paper proposes a new framework named FinMath, which improves the model’s numerical reasoning capacity by injecting a tree-structured neural model to perform multi-step numerical reasoning.

Question Answering

Paper
Add Code

Evaluating LLMs at Detecting Errors in LLM Responses

1 code implementation • 4 Apr 2024 • Ryo Kamoi, Sarkar Snigdha Sarathi Das, Renze Lou, Jihyun Janice Ahn, Yilun Zhao, Xiaoxin Lu, Nan Zhang, Yusen Zhang, Ranran Haoran Zhang, Sujeeth Reddy Vummanthala, Salika Dave, Shaobo Qin, Arman Cohan, Wenpeng Yin, Rui Zhang

This work introduces ReaLMistake, the first error detection benchmark consisting of objective, realistic, and diverse errors made by LLMs.

Instruction Following

Paper
Code

MIMIR: A Streamlined Platform for Personalized Agent Tuning in Domain Expertise

no code implementations • 3 Apr 2024 • Chunyuan Deng, Xiangru Tang, Yilun Zhao, Hanming Wang, Haoran Wang, Wangchunshu Zhou, Arman Cohan, Mark Gerstein

Recently, large language models (LLMs) have evolved into interactive agents, proficient in planning, tool use, and task execution across a wide variety of tasks.

Paper
Add Code

Prioritizing Safeguarding Over Autonomy: Risks of LLM Agents for Science

no code implementations • 6 Feb 2024 • Xiangru Tang, Qiao Jin, Kunlun Zhu, Tongxin Yuan, Yichi Zhang, Wangchunshu Zhou, Meng Qu, Yilun Zhao, Jian Tang, Zhuosheng Zhang, Arman Cohan, Zhiyong Lu, Mark Gerstein

Intelligent agents powered by large language models (LLMs) have demonstrated substantial promise in autonomously conducting experiments and facilitating scientific discoveries across various disciplines.

Paper
Add Code

Uncertainty of Thoughts: Uncertainty-Aware Planning Enhances Information Seeking in Large Language Models

1 code implementation • 5 Feb 2024 • Zhiyuan Hu, Chumin Liu, Xidong Feng, Yilun Zhao, See-Kiong Ng, Anh Tuan Luu, Junxian He, Pang Wei Koh, Bryan Hooi

In the face of uncertainty, the ability to seek information is of fundamental importance.

Medical Diagnosis

Paper
Code

ML-Bench: Evaluating Large Language Models for Code Generation in Repository-Level Machine Learning Tasks

1 code implementation • 16 Nov 2023 • Yuliang Liu, Xiangru Tang, Zefan Cai, Junjie Lu, Yichi Zhang, Yanjun Shao, Zexuan Deng, Helan Hu, Kaikai An, Ruijun Huang, Shuzheng Si, Sheng Chen, Haozhe Zhao, Liang Chen, Yan Wang, Tianyu Liu, Zhiwei Jiang, Baobao Chang, Yujia Qin, Wangchunshu Zhou, Yilun Zhao, Arman Cohan, Mark Gerstein

While Large Language Models (LLMs) have demonstrated proficiency in code generation benchmarks, translating these results into practical development scenarios - where leveraging existing repository-level libraries is the norm - remains challenging.

Code Generation Navigate

Paper
Code

MedAgents: Large Language Models as Collaborators for Zero-shot Medical Reasoning

1 code implementation • 16 Nov 2023 • Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, Mark Gerstein

Large language models (LLMs), despite their remarkable progress across various general domains, encounter significant barriers in medicine and healthcare.

152

Paper
Code

DocMath-Eval: Evaluating Numerical Reasoning Capabilities of LLMs in Understanding Long Documents with Tabular Data

no code implementations • 16 Nov 2023 • Yilun Zhao, Yitao Long, Hongjun Liu, Linyong Nan, Lyuhao Chen, Ryo Kamoi, Yixin Liu, Xiangru Tang, Rui Zhang, Arman Cohan

This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning and problem-solving capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables.

Math

Paper
Add Code

On Evaluating the Integration of Reasoning and Action in LLM Agents with Database Question Answering

no code implementations • 16 Nov 2023 • Linyong Nan, Ellen Zhang, Weijin Zou, Yilun Zhao, Wenfei Zhou, Arman Cohan

A key discovery is the identification of two primary bottlenecks hindering effective interaction: the capacity for planning and the ability to generate multiple SQL queries.

Question Answering Retrieval

Paper
Add Code

KnowledgeMath: Knowledge-Intensive Math Word Problem Solving in Finance Domains

1 code implementation • 16 Nov 2023 • Yilun Zhao, Hongjun Liu, Yitao Long, Rui Zhang, Chen Zhao, Arman Cohan

We introduce KnowledgeMath, a novel benchmark designed to evaluate LLMs' capabilities in applying financial knowledge to solve complex math word problems.

Math Math Word Problem Solving +1

Paper
Code

Investigating Data Contamination in Modern Benchmarks for Large Language Models

no code implementations • 16 Nov 2023 • Chunyuan Deng, Yilun Zhao, Xiangru Tang, Mark Gerstein, Arman Cohan

Recent observations have underscored a disparity between the inflated benchmark scores and the actual performance of LLMs, raising concerns about potential contamination of evaluation benchmarks.

Common Sense Reasoning Multiple-choice +1

Paper
Add Code

Benchmarking Generation and Evaluation Capabilities of Large Language Models for Instruction Controllable Summarization

1 code implementation • 15 Nov 2023 • Yixin Liu, Alexander R. Fabbri, Jiawen Chen, Yilun Zhao, Simeng Han, Shafiq Joty, PengFei Liu, Dragomir Radev, Chien-Sheng Wu, Arman Cohan

Our study reveals that instruction controllable text summarization remains a challenging task for LLMs, since (1) all LLMs evaluated still make factual and other types of errors in their summaries; (2) all LLM-based evaluation methods cannot achieve a strong alignment with human annotators when judging the quality of candidate summaries; (3) different LLMs show large performance gaps in summary generation and evaluation.

Benchmarking Text Summarization

Paper
Code

L2CEval: Evaluating Language-to-Code Generation Capabilities of Large Language Models

no code implementations • 29 Sep 2023 • Ansong Ni, Pengcheng Yin, Yilun Zhao, Martin Riddell, Troy Feng, Rui Shen, Stephen Yin, Ye Liu, Semih Yavuz, Caiming Xiong, Shafiq Joty, Yingbo Zhou, Dragomir Radev, Arman Cohan

Recently, large language models (LLMs), especially those that are pretrained on code, have demonstrated strong capabilities in generating programs from natural language inputs in a few-shot or even zero-shot manner.

Code Generation Math +1

Paper
Add Code

Struc-Bench: Are Large Language Models Really Good at Generating Complex Structured Data?

1 code implementation • 16 Sep 2023 • Xiangru Tang, Yiming Zong, Jason Phang, Yilun Zhao, Wangchunshu Zhou, Arman Cohan, Mark Gerstein

Despite the remarkable capabilities of Large Language Models (LLMs) like GPT-4, producing complex, structured tabular data remains challenging.

Hallucination

Paper
Code

ODSum: New Benchmarks for Open Domain Multi-Document Summarization

1 code implementation • 16 Sep 2023 • Yijie Zhou, Kejian Shi, Wencai Zhang, Yixin Liu, Yilun Zhao, Arman Cohan

Open-domain Multi-Document Summarization (ODMDS) is a critical tool for condensing vast arrays of documents into coherent, concise summaries.

Document Summarization Multi-Document Summarization +1

Paper
Code

RobuT: A Systematic Study of Table QA Robustness Against Human-Annotated Adversarial Perturbations

1 code implementation • 25 Jun 2023 • Yilun Zhao, Chen Zhao, Linyong Nan, Zhenting Qi, Wenlin Zhang, Xiangru Tang, Boyu Mi, Dragomir Radev

Despite significant progress having been made in question answering on tabular data (Table QA), it's unclear whether, and to what extent existing Table QA models are robust to task-specific perturbations, e. g., replacing key question entities or shuffling table columns.

Few-Shot Learning Question Answering

Paper
Code

Investigating Table-to-Text Generation Capabilities of LLMs in Real-World Information Seeking Scenarios

2 code implementations • 24 May 2023 • Yilun Zhao, Haowei Zhang, Shengyun Si, Linyong Nan, Xiangru Tang, Arman Cohan

These include the LogicNLG and our newly-constructed LoTNLG datasets for data insight generation, along with the FeTaQA and our newly-constructed F2WTQ datasets for query-based generation.

Table-to-Text Generation

Paper
Code

QTSumm: Query-Focused Summarization over Tabular Data

2 code implementations • 23 May 2023 • Yilun Zhao, Zhenting Qi, Linyong Nan, Boyu Mi, Yixin Liu, Weijin Zou, Simeng Han, Ruizhe Chen, Xiangru Tang, Yumo Xu, Dragomir Radev, Arman Cohan

Motivated by this, we define a new query-focused table summarization task, where text generation models have to perform human-like reasoning and analysis over the given table to generate a tailored summary.

Query-focused Summarization Table-to-Text Generation

Paper
Code

Enhancing Few-shot Text-to-SQL Capabilities of Large Language Models: A Study on Prompt Design Strategies

no code implementations • 21 May 2023 • Linyong Nan, Yilun Zhao, Weijin Zou, Narutatsu Ri, Jaesung Tae, Ellen Zhang, Arman Cohan, Dragomir Radev

In-context learning (ICL) has emerged as a new approach to various natural language processing tasks, utilizing large language models (LLMs) to make predictions based on context that has been supplemented with a few examples or task-specific instructions.

In-Context Learning Question Answering +1

Paper
Add Code

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

1 code implementation • 7 Mar 2023 • Yixin Liu, Alexander R. Fabbri, Yilun Zhao, PengFei Liu, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir Radev

Interpretability and efficiency are two important considerations for the adoption of neural automatic metrics.

Paper
Code

LoFT: Enhancing Faithfulness and Diversity for Table-to-Text Generation via Logic Form Control

1 code implementation • 6 Feb 2023 • Yilun Zhao, Zhenting Qi, Linyong Nan, Lorenzo Jaime Yu Flores, Dragomir Radev

Logical Table-to-Text (LT2T) generation is tasked with generating logically faithful sentences from tables.

Table-to-Text Generation

Paper
Code

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

2 code implementations • 15 Dec 2022 • Yixin Liu, Alexander R. Fabbri, PengFei Liu, Yilun Zhao, Linyong Nan, Ruilin Han, Simeng Han, Shafiq Joty, Chien-Sheng Wu, Caiming Xiong, Dragomir Radev

Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests.

Paper
Code

ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples

1 code implementation • 22 Oct 2022 • Yilun Zhao, Linyong Nan, Zhenting Qi, Rui Zhang, Dragomir Radev

Reasoning over tabular data requires both table structure understanding and a broad set of table reasoning skills.

Ranked #3 on Semantic Parsing on WikiSQL (Denotation accuracy (test) metric)

Fact Verification Question Answering +3

Paper
Code

FOLIO: Natural Language Reasoning with First-Order Logic

1 code implementation • 2 Sep 2022 • Simeng Han, Hailey Schoelkopf, Yilun Zhao, Zhenting Qi, Martin Riddell, Luke Benson, Lucy Sun, Ekaterina Zubova, Yujie Qiao, Matthew Burtell, David Peng, Jonathan Fan, Yixin Liu, Brian Wong, Malcolm Sailor, Ansong Ni, Linyong Nan, Jungo Kasai, Tao Yu, Rui Zhang, Shafiq Joty, Alexander R. Fabbri, Wojciech Kryscinski, Xi Victoria Lin, Caiming Xiong, Dragomir Radev

We present FOLIO, a human-annotated, open-domain, and logically complex and diverse dataset for reasoning in natural language (NL), equipped with first order logic (FOL) annotations.

Language Modelling Large Language Model +1

Paper
Code

MultiHiertt: Numerical Reasoning over Multi Hierarchical Tabular and Textual Data

1 code implementation • ACL 2022 • Yilun Zhao, Yunxiang Li, Chenying Li, Rui Zhang

Numerical reasoning over hybrid data containing both textual and tabular content (e. g., financial reports) has recently attracted much attention in the NLP community.

Question Answering

Paper
Code

R2D2: Robust Data-to-Text with Replacement Detection

1 code implementation • 25 May 2022 • Linyong Nan, Lorenzo Jaime Yu Flores, Yilun Zhao, Yixin Liu, Luke Benson, Weijin Zou, Dragomir Radev

Unfaithful text generation is a common problem for text generation systems.

Data-to-Text Generation Entity Retrieval +2

Paper
Code

Apparel-invariant Feature Learning for Apparel-changed Person Re-identification

no code implementations • 14 Aug 2020 • Zhengxu Yu, Yilun Zhao, Bin Hong, Zhongming Jin, Jianqiang Huang, Deng Cai, Xiaofei He, Xian-Sheng Hua

Therefore, it is critical to learn an apparel-invariant person representation under cases like cloth changing or several persons wearing similar clothes.

Person Re-Identification Representation Learning

Paper
Add Code

MusiCoder: A Universal Music-Acoustic Encoder Based on Transformers

no code implementations • 3 Aug 2020 • Yilun Zhao, Jia Guo

The performance of MusiCoder is evaluated in two downstream music annotation tasks.

Genre classification Information Retrieval +5

Paper
Add Code

Cannot find the paper you are looking for? You can Submit a new open access paper.