Long-Context Understanding

12 papers with code • 2 benchmarks • 0 datasets

This task has no description! Would you like to contribute one?

Benchmarks

Add a Result

These leaderboards are used to track progress in Long-Context Understanding

Trend	Dataset	Best Model	Paper	Code	Compare
	Ada-LEval (BestAnswer)	GPT-4-Turbo-1106			See all
	Ada-LEval (TSort)	GPT-4-Turbo-1106			See all

Most implemented papers

Most implemented Social Latest No code

GLM-130B: An Open Bilingual Pre-trained Model

thudm/glm-130b • • 5 Oct 2022

We introduce GLM-130B, a bilingual (English and Chinese) pre-trained language model with 130 billion parameters.

Paper
Code

GPT-4 Technical Report

openai/evals • Preprint 2023

We report the development of GPT-4, a large-scale, multimodal model which can accept image and text inputs and produce text outputs.

Paper
Code

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

lm-sys/fastchat • • NeurIPS 2023

Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences.

Paper
Code

FABLES: Evaluating faithfulness and content selection in book-length summarization

mungg/fables • 1 Apr 2024

While LLM-based auto-raters have proven reliable for factuality and coherence in other settings, we implement several LLM raters of faithfulness and find that none correlates strongly with human annotations, especially with regard to detecting unfaithful claims.

Paper
Code

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models

lfy79001/s3eval • 23 Oct 2023

The rapid development of Large Language Models (LLMs) has led to great strides in model capabilities like long-context understanding and reasoning.

Paper
Code

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

thudm/longbench • • 28 Aug 2023

In this paper, we introduce LongBench, the first bilingual, multi-task benchmark for long context understanding, enabling a more rigorous evaluation of long context understanding.

Paper
Code