Winogrande
12 papers with code • 0 benchmarks • 0 datasets
Benchmarks
These leaderboards are used to track progress in Winogrande
Most implemented papers
WinoGrande: An Adversarial Winograd Schema Challenge at Scale
The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations.
Scaling Language Models: Methods, Analysis & Insights from Training Gopher
Language modelling provides a step towards intelligent communication systems by harnessing large repositories of written human knowledge to better predict and understand the world.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
But advancing the state-of-the-art across a broad set of natural language tasks has been hindered by training instabilities and uncertain quality during fine-tuning.
Generative Data Augmentation for Commonsense Reasoning
Recent advances in commonsense reasoning depend on large-scale human-annotated training data to achieve peak performance.
UNICORN on RAINBOW: A Universal Commonsense Reasoning Model on a New Multitask Benchmark
First, we propose a new multitask benchmark, RAINBOW, to promote research on commonsense models that generalize well over multiple tasks and datasets.
Few-Shot Out-of-Domain Transfer Learning of Natural Language Explanations in a Label-Abundant Setup
A potential solution is the few-shot out-of-domain transfer of NLEs from a parent task with many NLEs to a child task.
Are Hard Examples also Harder to Explain? A Study with Human and Model-Generated Explanations
We observe that (1) GPT-3 explanations are as grammatical as human explanations regardless of the hardness of the test samples, (2) for easy examples, GPT-3 generates highly supportive explanations but human explanations are more generalizable, and (3) for hard examples, human explanations are significantly better than GPT-3 explanations both in terms of label-supportiveness and generalizability judgements.
LISA: Layerwise Importance Sampling for Memory-Efficient Large Language Model Fine-Tuning
Attempting to complement this deficiency, we investigate the layerwise properties of LoRA on fine-tuning tasks and observe an unexpected but consistent skewness of weight norms across different layers.
LayerSkip: Enabling Early Exit Inference and Self-Speculative Decoding
We present LayerSkip, an end-to-end solution to speed-up inference of large language models (LLMs).
$\texttt{metabench}$ -- A Sparse Benchmark to Measure General Ability in Large Language Models
Large Language Models (LLMs) vary in their abilities on a range of tasks.