The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average. Data includes vital signs, medications, laboratory measurements, observations and notes charted by care providers, fluid balance, procedure codes, diagnostic codes, imaging reports, hospital length of stay, survival data, and more.
617 PAPERS • 7 BENCHMARKS
The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is to do automated chest x-ray interpretation, featuring uncertainty labels and radiologist-labeled reference standard evaluation sets.
275 PAPERS • 1 BENCHMARK
The Digital Retinal Images for Vessel Extraction (DRIVE) dataset is a dataset for retinal vessel segmentation. It consists of a total of JPEG 40 color fundus images; including 7 abnormal pathology cases. The images were obtained from a diabetic retinopathy screening program in the Netherlands. The images were acquired using Canon CR5 non-mydriatic 3CCD camera with FOV equals to 45 degrees. Each image resolution is 584*565 pixels with eight bits per color channel (3 channels).
202 PAPERS • 2 BENCHMARKS
The fastMRI dataset includes two types of MRI scans: knee MRIs and the brain (neuro) MRIs, and containing training, validation, and masked test sets. The deidentified imaging dataset provided by NYU Langone comprises raw k-space data in several sub-dataset groups. Curation of these data are part of an IRB approved study. Raw and DICOM data have been deidentified via conversion to the vendor-neutral ISMRMD format and the RSNA clinical trial processor, respectively. Also, each DICOM image is manually inspected for the presence of any unexpected protected health information (PHI), with spot checking of both metadata and image content. Knee MRI: Data from more than 1,500 fully sampled knee MRIs obtained on 3 and 1.5 Tesla magnets and DICOM images from 10,000 clinical knee MRIs also obtained at 3 or 1.5 Tesla. The raw dataset includes coronal proton density-weighted images with and without fat suppression. The DICOM dataset contains coronal proton density-weighted with and without fat suppr
177 PAPERS • 4 BENCHMARKS
The LIDC-IDRI dataset contains lesion annotations from four experienced thoracic radiologists. LIDC-IDRI contains 1,018 low-dose lung CTs from 1010 lung patients.
148 PAPERS • 3 BENCHMARKS
ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 to 2015) unique patients with the text-mined fourteen common disease labels, mined from the text radiological reports via NLP techniques. It expands on ChestX-ray8 by adding six additional thorax diseases: Edema, Emphysema, Fibrosis, Pleural Thickening and Hernia.
144 PAPERS • 4 BENCHMARKS
CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.
130 PAPERS • 2 BENCHMARKS
BraTS 2018 is a dataset which provides multimodal 3D brain MRIs and ground truth brain tumor segmentations annotated by physicians, consisting of 4 MRI modalities per case (T1, T1c, T2, and FLAIR). Annotations include 3 tumor subregions—the enhancing tumor, the peritumoral edema, and the necrotic and non-enhancing tumor core. The annotations were combined into 3 nested subregions—whole tumor (WT), tumor core (TC), and enhancing tumor (ET). The data were collected from 19 institutions, using various MRI scanners
107 PAPERS • 2 BENCHMARKS
The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.
100 PAPERS • 6 BENCHMARKS
The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) color fundus images. For each image, two groups of annotations are provided..
96 PAPERS • 1 BENCHMARK
HAM10000 is a dataset of 10000 training images for detecting pigmented skin lesions. The authors collected dermatoscopic images from different populations, acquired and stored by different modalities.
86 PAPERS • 2 BENCHMARKS
The LUNA challenges provide datasets for automatic nodule detection algorithms using the largest publicly available reference database of chest CT scans, the LIDC-IDRI data set. In LUNA16, participants develop their algorithm and upload their predictions on 888 CT scans in one of the two tracks: 1) the complete nodule detection track where a complete CAD system should be developed, or 2) the false positive reduction track where a provided set of nodule candidates should be classified.
86 PAPERS • 2 BENCHMARKS
Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then verified by an experienced gastroenterologist.
66 PAPERS • 3 BENCHMARKS
The Medical Segmentation Decathlon is a collection of medical image segmentation datasets. It contains a total of 2,633 three-dimensional images collected across multiple anatomies of interest, multiple modalities and multiple sources. Specifically, it contains data for the following body organs or parts: Brain, Heart, Liver, Hippocampus, Prostate, Lung, Pancreas, Hepatic Vessel, Spleen and Colon.
65 PAPERS • 1 BENCHMARK
The LUNA16 (LUng Nodule Analysis) dataset is a dataset for lung segmentation. It consists of 1,186 lung nodules annotated in 888 CT scans.
63 PAPERS • 1 BENCHMARK
The BRATS2017 dataset. It contains 285 brain tumor MRI scans, with four MRI modalities as T1, T1ce, T2, and Flair for each scan. The dataset also provides full masks for brain tumors, with labels for ED, ET, NET/NCR. The segmentation evaluation is based on three tasks: WT, TC and ET segmentation.
61 PAPERS • 1 BENCHMARK
ChestX-ray8 is a medical imaging dataset which comprises 108,948 frontal-view X-ray images of 32,717 (collected from the year of 1992 to 2015) unique patients with the text-mined eight common disease labels, mined from the text radiological reports via NLP techniques.
59 PAPERS • NO BENCHMARKS YET
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies were performed at Beth Israel Deaconess Medical Center in Boston, MA.
58 PAPERS • 1 BENCHMARK
The Parkinson’s Progression Markers Initiative (PPMI) dataset originates from an observational clinical and longitudinal study comprising evaluations of people with Parkinson’s disease (PD), those people with high risk, and those who are healthy.
58 PAPERS • 2 BENCHMARKS
Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The videos are captured at 25 fps and downsampled to 1 fps for processing. The whole dataset is labeled with the phase and tool presence annotations. The phases have been defined by a senior surgeon in Strasbourg hospital, France. Since the tools are sometimes hardly visible in the images and thus difficult to be recognized visually, a tool is defined as present in an image if at least half of the tool tip is visible.
56 PAPERS • 1 BENCHMARK
The BraTS 2015 dataset is a dataset for brain tumor image segmentation. It consists of 220 high grade gliomas (HGG) and 54 low grade gliomas (LGG) MRIs. The four MRI modalities are T1, T1c, T2, and T2FLAIR. Segmented “ground truth” is provide about four intra-tumoral classes, viz. edema, enhancing tumor, non-enhancing tumor, and necrosis.
54 PAPERS • 1 BENCHMARK
The sleep-edf database contains 197 whole-night PolySomnoGraphic sleep recordings, containing EEG, EOG, chin EMG, and event markers. Some records also contain respiration and body temperature. Corresponding hypnograms (sleep patterns) were manually scored by well-trained technicians according to the Rechtschaffen and Kales manual, and are also available.
53 PAPERS • 4 BENCHMARKS
PatchCamelyon is an image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. PCam provides a new benchmark for machine learning models: bigger than CIFAR10, smaller than ImageNet, trainable on a single GPU.
52 PAPERS • 2 BENCHMARKS
The PROMISE12 dataset was made available for the MICCAI 2012 prostate segmentation challenge. Magnetic Resonance (MR) images (T2-weighted) of 50 patients with various diseases were acquired at different locations with several MRI vendors and scanning protocols.
49 PAPERS • 1 BENCHMARK
The dataset used in this challenge consists of 165 images derived from 16 H&E stained histological sections of stage T3 or T42 colorectal adenocarcinoma. Each section belongs to a different patient, and sections were processed in the laboratory on different occasions. Thus, the dataset exhibits high inter-subject variability in both stain distribution and tissue architecture. The digitization of these histological sections into whole-slide images (WSIs) was accomplished using a Zeiss MIRAX MIDI Slide Scanner with a pixel resolution of 0.465µm.
47 PAPERS • 1 BENCHMARK
PadChest is a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score.
43 PAPERS • NO BENCHMARKS YET
CHASE_DB1 is a dataset for retinal vessel segmentation which contains 28 color retina images with the size of 999×960 pixels which are collected from both left and right eyes of 14 school children. Each image is annotated by two independent human experts.
37 PAPERS • 2 BENCHMARKS
MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines.
33 PAPERS • 1 BENCHMARK
Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.
31 PAPERS • NO BENCHMARKS YET
BRATS 2013 is a brain tumor segmentation dataset consists of synthetic and real images, where each of them is further divided into high-grade gliomas (HG) and low-grade gliomas (LG). There are 25 patients with both synthetic HG and LG images and 20 patients with real HG and 10 patients with real LG images. For each patient, FLAIR, T1, T2, and post-Gadolinium T1 magnetic resonance (MR) image sequences are available.
30 PAPERS • 2 BENCHMARKS
A large dataset of musculoskeletal radiographs containing 40,561 images from 14,863 studies, where each study is manually labeled by radiologists as either normal or abnormal.
27 PAPERS • NO BENCHMARKS YET
The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. Each subset contains one healthy fundus image, one image of patient with diabetic retinopathy and one glaucoma image. The image sizes are 3,304 x 2,336, with a training/testing image split of 22/23.
26 PAPERS • 1 BENCHMARK
We introduce here a new database called UBFC-rPPG (stands for Univ. Bourgogne Franche-Comté Remote PhotoPlethysmoGraphy) comprising two datasets that are focused specifically on rPPG analysis. The UBFC-RPPG database was created using a custom C++ application for video acquisition with a simple low cost webcam (Logitech C920 HD Pro) at 30fps with a resolution of 640x480 in uncompressed 8-bit RGB format. A CMS50E transmissive pulse oximeter was used to obtain the ground truth PPG data comprising the PPG waveform as well as the PPG heart rates. During the recording, the subject sits in front of the camera (about 1m away from the camera) with his/her face visible. All experiments are conducted indoors with a varying amount of sunlight and indoor illumination. The link to download the complete video dataset is available on request. A basic Matlab implementation can also be provided to read ground truth data acquired with a pulse oximeter.
26 PAPERS • 1 BENCHMARK
The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "TAC2 Biomedical Summarization Track Training Data Set" .
24 PAPERS • 3 BENCHMARKS
CVC-ClinicDB is an open-access dataset of 612 images with a resolution of 384×288 from 31 colonoscopy sequences.It is used for medical image segmentation, in particular polyp detection in colonoscopy videos.
22 PAPERS • 1 BENCHMARK
LiTS17 is a liver tumor segmentation benchmark. The data and segmentations are provided by various clinical sites around the world. The training data set contains 130 CT scans and the test data set 70 CT scans. Image Source: https://arxiv.org/pdf/1707.07734.pdf
21 PAPERS • 2 BENCHMARKS
The colorectal nuclear segmentation and phenotypes (CoNSeP) dataset consists of 41 H&E stained image tiles, each of size 1,000×1,000 pixels at 40× objective magnification. The images were extracted from 16 colorectal adenocarcinoma (CRA) WSIs, each belonging to an individual patient, and scanned with an Omnyx VL120 scanner within the department of pathology at University Hospitals Coventry and Warwickshire, UK.
20 PAPERS • 1 BENCHMARK
The MMSE-HR benchmark consists of a dataset of 102 videos from 40 subjects recorded at 1040x1392 raw resolution at 25fps. During the recordings, various stimuli such as videos, sounds, and smells are introduced to induce different emotional states in the subjects. The ground truth waveform for MMSE-HR is the blood pressure signal sampled at 1000Hz. The dataset contains a diverse distribution of skin colors in the Fitzpatrick scale (II=8, III=11, IV=17, V+VI=4).
20 PAPERS • 1 BENCHMARK
MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.
16 PAPERS • NO BENCHMARKS YET
PanNuke is a semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 205,343 labeled nuclei, each with an instance segmentation mask.
16 PAPERS • 1 BENCHMARK
CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.
15 PAPERS • 1 BENCHMARK
MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
15 PAPERS • NO BENCHMARKS YET
The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units (ICU). Each record consists of roughly 48 hours of multivariate time series data with up to 37 features recorded at various times from the patients during their stay such as respiratory rate, glucose etc.
15 PAPERS • 4 BENCHMARKS
The National Institutes of Health’s Clinical Center has made a large-scale dataset of CT images publicly available to help the scientific community improve detection accuracy of lesions. While most publicly available medical image datasets have less than a thousand lesions, this dataset, named DeepLesion, has over 32,000 annotated lesions (220GB) identified on CT images. DeepLesion, a dataset with 32,735 lesions in 32,120 CT slices from 10,594 studies of 4,427 unique patients. There are a variety of lesion types in this dataset, such as lung nodules, liver tumors, enlarged lymph nodes, and so on. It has the potential to be used in various medical image applications
13 PAPERS • 1 BENCHMARK
IXI Dataset is a collection of 600 MR brain images from normal, healthy subjects. The MR image acquisition protocol for each subject includes:
12 PAPERS • 3 BENCHMARKS
Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans of were randomly selected from a combination of an ongoing colorectal cancer chemotherapy trial, and a retrospective ventral hernia study. The 50 scans were captured during portal venous contrast phase with variable volume sizes (512 x 512 x 85 - 512 x 512 x 198) and field of views (approx. 280 x 280 x 280 mm3 - 500 x 500 x 650 mm3). The in-plane resolution varies from 0.54 x 0.54 mm2 to 0.98 x 0.98 mm2, while the slice thickness ranges from 2.5 mm to 5.0 mm. The standard registration data was generated by NiftyReg.
12 PAPERS • 2 BENCHMARKS
BRATS 2016 is a brain tumor segmentation dataset. It shares the same training set as BRATS 2015, which consists of 220 HHG and 54 LGG. Its testing dataset consists of 191 cases with unknown grades. Image Source: https://sites.google.com/site/braintumorsegmentation/home/brats_2016
11 PAPERS • NO BENCHMARKS YET
The MIT-BIH Arrhythmia Database contains 48 half-hour excerpts of two-channel ambulatory ECG recordings, obtained from 47 subjects studied by the BIH Arrhythmia Laboratory between 1975 and 1979. Twenty-three recordings were chosen at random from a set of 4000 24-hour ambulatory ECG recordings collected from a mixed population of inpatients (about 60%) and outpatients (about 40%) at Boston's Beth Israel Hospital; the remaining 25 recordings were selected from the same set to include less common but clinically significant arrhythmias that would not be well-represented in a small random sample.
11 PAPERS • 1 BENCHMARK