The RAFT benchmark (Realworld Annotated Few-shot Tasks) focuses on naturally occurring tasks and uses an evaluation setup that mirrors deployment.
RAFT is a few-shot classification benchmark that tests language models:
Description from: https://raft.elicit.org/
Paper | Code | Results | Date | Stars |
---|