VoxCeleb1 is an audio dataset containing over 100,000 utterances for 1,251 celebrities, extracted from videos uploaded to YouTube.
680 PAPERS • 10 BENCHMARKS
BIG-Bench Hard (BBH) is a subset of the BIG-Bench, a diverse evaluation suite for language models. BBH focuses on a suite of 23 challenging tasks from BIG-Bench that were found to be beyond the capabilities of current language models. These tasks are ones where prior language model evaluations did not outperform the average human-rater.
351 PAPERS • 4 BENCHMARKS
The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Big-bench include more than 200 tasks.
349 PAPERS • 134 BENCHMARKS
GPQA stands for Graduate-Level Google-Proof Q&A Benchmark. It's a challenging dataset designed to evaluate the capabilities of Large Language Models (LLMs) and scalable oversight mechanisms. Let me provide more details about it:
264 PAPERS • 1 BENCHMARK
Dataset Summary
110 PAPERS • 1 BENCHMARK
The VGG Face dataset is face identity recognition dataset that consists of 2,622 identities. It contains over 2.6 million images.
94 PAPERS • 1 BENCHMARK
ScanNet++ is a large scale dataset with 450+ 3D indoor scenes containing sub-millimeter resolution laser scans, registered 33-megapixel DSLR images, and commodity RGB-D streams from iPhone. The 3D reconstructions are annotated with long-tail and label-ambiguous semantics to benchmark semantic understanding methods, while the coupled DSLR and iPhone captures enable benchmarking of novel view synthesis methods in high-quality and commodity settings.
25 PAPERS • 6 BENCHMARKS
100 tasks from LIBERO-100 suite. Note that the datasets are split under the folder names of LIBERO-90 and LIBERO-10.
22 PAPERS • 1 BENCHMARK
WISE, the first benchmark specifically designed for World Knowledge-Informed Semantic Evaluation. WISE moves beyond simple word-pixel mapping by challenging models with 1000 meticulously crafted prompts across 25 sub-domains in cultural common sense, spatio-temporal understanding, and natural science.
22 PAPERS • 2 BENCHMARKS
CoIR (Code Information Retrieval) benchmark, is designed to evaluate code retrieval capabilities. CoIR includes 10 curated code datasets, covering 8 retrieval tasks across 7 domains. In total, it encompasses two million documents. It also provides a common and easy Python framework, installable via pip, and shares the same data schema as benchmarks like MTEB and BEIR for easy cross-benchmark evaluations.
18 PAPERS • 7 BENCHMARKS
100 tasks from LIBERO-100 suite. Note that the datasets are split under the folder names of LIBERO-90 and LIBERO-10. The 10 contains selected task that require long-horizon task completion.
12 PAPERS • 1 BENCHMARK
TUM monoVO is a dataset for evaluating the tracking accuracy of monocular Visual Odometry (VO) and SLAM methods. It contains 50 real-world sequences comprising over 100 minutes of video, recorded across different environments – ranging from narrow indoor corridors to wide outdoor scenes. All sequences contain mostly exploring camera motion, starting and ending at the same position: this allows to evaluate tracking accuracy via the accumulated drift from start to end, without requiring ground-truth for the full sequence. In contrast to existing datasets, all sequences are photometrically calibrated: the dataset creators provide the exposure times for each frame as reported by the sensor, the camera response function and the lens attenuation factors (vignetting).
11 PAPERS • NO BENCHMARKS YET
The QMUL underGround Re-IDentification (GRID) dataset contains 250 pedestrian image pairs. Each pair contains two images of the same individual seen from different camera views. All images are captured from 8 disjoint camera views installed in a busy underground station. The figures beside show a snapshot of each of the camera views of the station and sample images in the dataset. The dataset is challenging due to variations of pose, colours, lighting changes; as well as poor image quality caused by low spatial resolution.
10 PAPERS • 5 BENCHMARKS
Language Model (LM) agents for cybersecurity that are capable of autonomously identifying vulnerabilities and executing exploits have potential to cause real-world impact. Policymakers, model providers, and researchers in the AI and cybersecurity communities are interested in quantifying the capabilities of such agents to help mitigate cyberrisk and investigate opportunities for penetration testing. Toward that end, we introduce Cybench, a framework for specifying cybersecurity tasks and evaluating agents on those tasks. We include 40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties. Each task includes its own description, starter files, and is initialized in an environment where an agent can execute commands and observe outputs. Since many tasks are beyond the capabilities of existing LM agents, we introduce subtasks for each task, which break down a task into intermediary steps
9 PAPERS • 1 BENCHMARK
ImgEdit is a large-scale, high-quality image-editing dataset comprising 1.2 million carefully curated edit pairs, which contain both novel and complex single-turn edits, as well as challenging multi-turn tasks.
A multimodal agent benchmark on professional data science and engineering. * 494 real-world tasks, ranging from data warehousing to orchestration; * 20 professional enterprise-level applications (e.g., BigQuery, dbt, Airbyte, etc.); * both command line (CLI) and graphical user interfaces (GUI); * an interactive executable computer environment; * a document warehouse for agent retrieval.
7 PAPERS • NO BENCHMARKS YET
Dataset Description: The interaction of 72 kinase inhibitors with 442 kinases covering >80% of the human catalytic protein kinome.
6 PAPERS • 3 BENCHMARKS
The ImplicitQA dataset was introduced in the paper ImplicitQA: Going beyond frames towards Implicit Video Reasoning.
5 PAPERS • 1 BENCHMARK
The pioneering eyeblink detection dataset is characterized by three key features: (1) Sample with multi-human instances. (2) Unconstrained in-the-wild scenarios. (3) Untrimmed videos. These attributes make the dataset more challenging and better aligned with real-world scenarios.
3 PAPERS • 2 BENCHMARKS
At RSNA 2017 there was a contest to correctly identify the age of a child from an X-ray of their hand.
2 PAPERS • 1 BENCHMARK
How Do I Ask a Question at Expedia? [Complete Guide + Support Numbers] When planning a trip, booking a hotel, or managing travel changes, it’s natural to have questions—especially when things don’t go as expected. Expedia, one of the world’s leading online travel agencies, provides multiple customer support options to help you ask questions and resolve travel-related issues effectively. Whether you’re wondering about a reservation, refund, or policy, this comprehensive guide will walk you through how to ask a question at Expedia, using various communication methods including phone, chat, email, and more.
1 PAPER • NO BENCHMARKS YET
To speak with a 𝗘𝘅𝗽𝗲𝗱𝗶𝗮 professional, the best option is to call their customer service number. In Spain, you can call +1-805-330-4056. In the US, the number is +1-888-829-0881. In Mexico, you can call +1-805-330-4056. You can also start a chat in the 𝗘𝘅𝗽𝗲𝗱𝗶𝗮 app or on their website. If you need urgent help, you can try calling during off-peak hours to reduce wait time. To speak with a 𝗘𝘅𝗽𝗲𝗱𝗶𝗮 professional, call +1-888-829-0881 or +1-888-829-0881, use live chat, contact us on social media, or request assistance through the mobile app for immediate support at +1-888-829-0881 or +1-888-829-0881. If you need to reach an 𝗘𝘅𝗽𝗲𝗱𝗶𝗮 agent for help, you can reach their customer service team at +1-888-829-0881 or +1-888-829-0881. Follow these steps to speak w
This paper constructs 7-digit product Supply-Use Tables (SUTs) and symmetric Input-Output Tables (IOTs) for the Indian economy using microdata from the Annual Survey of Industries (ASI) for the period 2016-2021. We outline the methodology for generating input flows and reconciling registered and unregistered sector data via NPCMS-NIC concordance. The transition from SUTs to IOTs is explained using the Industry Technology Assumption. We apply this framework to analyse the economic impact—specifically Domestic Value Added (DVA) and employment influenced by production and exports. A case study of India's mobile phone sector reveals significant output growth, import substitution, an increase in exports, a shift in DVA/FVA shares, notable employment growth, with a leaning towards contractual labour, and increased female participation. These tables are valuable for analysing sectoral interdependencies and industrial policy effectiveness in India.
9 ways to reach Expedia customer service by phone number email or more step by step guide [Complete Guide + Support Numbers] When planning a trip, booking a hotel, or managing travel changes, it’s natural to have questions—especially when things don’t go as expected. Expedia, one of the world’s leading online travel agencies, provides multiple customer support options to help you ask questions and resolve travel-related issues effectively. Whether you’re wondering about a reservation, refund, or policy, this comprehensive guide will walk you through how to ask a question at Expedia, using various communication methods including phone, chat, email, and more.
An adaption of the MVTec Anomaly Detection dataset, presented in the paper "Domain-independent detection of known anomalies".
1 PAPER • 1 BENCHMARK
High-resolution early gastric cancer (EGC) detection and analysis: Patient Data:Datasets often include images from patients diagnosed with gastric cancer, specifically distinguishing between early gastric cancer (EGC) and Non -pathogenic gastric cancer (NGC). The study utilized data from 341 patients, with 124 classified as EGC and 217 as NGC. Image Types: High-resolution images are typically obtained from endoscopy image. Data Volume: The size of datasets mentioned a dataset of 1120 images specifically for EGC detection and 2150 images for NGC.
GitBugs is a comprehensive and up-to-date dataset comprising over 150,000 bug reports from nine actively maintained open-source projects, including Firefox, Cassandra, and VS Code. GitBugs aggregates data from Github, Bugzilla and Jira issue trackers, offering standardized categorical fields for classification tasks and predefined train/test splits for duplicate bug detection. In addition, it includes exploratory analysis notebooks and detailed project-level statistics, such as duplicate rates and resolution times. GitBugs supports various software engineering research tasks, including duplicate detection, retrieval augmented generation, resolution prediction, automated triaging, and temporal analysis. The openly licensed dataset provides a valuable cross-project resource for bench- marking and advancing automated bug report analysis. Access the data and code at this https URL.
HT Docking is a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million ``in-stock'' molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. It is used to study surrogate model accuracy for protein-ligand docking.
MathEquiv dataset is accompanied to EquivPruner . It is specifically designed for mathematical statement equivalence , serving as a versatile resource applicable to a variety of mathematical tasks and scenarios. It consists of almost 100k math sentences pair with equivalence result and reasoning step generated by GPT-4O.
Click to add a brisef description of the datdaset (Markdown and LaTeX enabled).
We developed Web-Bench as a benchmark for evaluating the performance of LLMs on real-world web projects.
How do I speak to someone on Expedia?, reach out to their customer support and request to speak with a supervisor or manager. (+1-888-829-0881 OR +1-805-330-4056 For quicker assistance, call Expedia's customer service at +1-888-829-0881 OR +1-805-330-4056 (US) for support in resolving your issue.