The Medical Information Mart for Intensive Care III (MIMIC-III) dataset is a large, de-identified and publicly-available collection of medical records. Each record in the dataset includes ICD-9 codes, which identify diagnoses and procedures performed. Each code is partitioned into sub-codes, which often include specific circumstantial details. The dataset consists of 112,000 clinical reports records (average length 709.3 tokens) and 1,159 top-level ICD-9 codes. Each report is assigned to 7.6 codes, on average.
471 PAPERS • 7 BENCHMARKS
The CheXpert dataset contains 224,316 chest radiographs of 65,240 patients with both frontal and lateral views available. The task is to do automated chest x-ray interpretation, featuring uncertainty labels and radiologist-labeled reference standard evaluation sets.
200 PAPERS • 1 BENCHMARK
The Digital Retinal Images for Vessel Extraction (DRIVE) dataset is a dataset for retinal vessel segmentation. It consists of a total of JPEG 40 color fundus images; including 7 abnormal pathology cases. The images were obtained from a diabetic retinopathy screening program in the Netherlands. The images were acquired using Canon CR5 non-mydriatic 3CCD camera with FOV equals to 45 degrees. Each image resolution is 584*565 pixels with eight bits per color channel (3 channels).
164 PAPERS • 2 BENCHMARKS
ChestX-ray14 is a medical imaging dataset which comprises 112,120 frontal-view X-ray images of 30,805 (collected from the year of 1992 to 2015) unique patients with the text-mined fourteen common disease labels, mined from the text radiological reports via NLP techniques. It expands on ChestX-ray8 by adding six additional thorax diseases: Edema, Emphysema, Fibrosis, Pleural Thickening and Hernia.
118 PAPERS • 3 BENCHMARKS
The LIDC-IDRI dataset contains lesion annotations from four experienced thoracic radiologists. LIDC-IDRI contains 1,018 low-dose lung CTs from 1010 lung patients.
118 PAPERS • 3 BENCHMARKS
The fastMRI dataset includes two types of MRI scans: knee MRIs and the brain (neuro) MRIs, and containing training, validation, and masked test sets. The deidentified imaging dataset provided by NYU Langone comprises raw k-space data in several sub-dataset groups. Curation of these data are part of an IRB approved study. Raw and DICOM data have been deidentified via conversion to the vendor-neutral ISMRMD format and the RSNA clinical trial processor, respectively. Also, each DICOM image is manually inspected for the presence of any unexpected protected health information (PHI), with spot checking of both metadata and image content. Knee MRI: Data from more than 1,500 fully sampled knee MRIs obtained on 3 and 1.5 Tesla magnets and DICOM images from 10,000 clinical knee MRIs also obtained at 3 or 1.5 Tesla. The raw dataset includes coronal proton density-weighted images with and without fat suppression. The DICOM dataset contains coronal proton density-weighted with and without fat suppr
115 PAPERS • 4 BENCHMARKS
CORD-19 is a free resource of tens of thousands of scholarly articles about COVID-19, SARS-CoV-2, and related coronaviruses for use by the global research community.
104 PAPERS • NO BENCHMARKS YET
BraTS 2018 is a dataset which provides multimodal 3D brain MRIs and ground truth brain tumor segmentations annotated by physicians, consisting of 4 MRI modalities per case (T1, T1c, T2, and FLAIR). Annotations include 3 tumor subregions—the enhancing tumor, the peritumoral edema, and the necrotic and non-enhancing tumor core. The annotations were combined into 3 nested subregions—whole tumor (WT), tumor core (TC), and enhancing tumor (ET). The data were collected from 19 institutions, using various MRI scanners
90 PAPERS • 2 BENCHMARKS
The STARE (Structured Analysis of the Retina) dataset is a dataset for retinal vessel segmentation. It contains 20 equal-sized (700×605) color fundus images. For each image, two groups of annotations are provided..
81 PAPERS • 1 BENCHMARK
The GENIA corpus is the primary collection of biomedical literature compiled and annotated within the scope of the GENIA project. The corpus was created to support the development and evaluation of information extraction and text mining systems for the domain of molecular biology.
74 PAPERS • 6 BENCHMARKS
The LUNA challenges provide datasets for automatic nodule detection algorithms using the largest publicly available reference database of chest CT scans, the LIDC-IDRI data set. In LUNA16, participants develop their algorithm and upload their predictions on 888 CT scans in one of the two tracks: 1) the complete nodule detection track where a complete CAD system should be developed, or 2) the false positive reduction track where a provided set of nodule candidates should be classified.
73 PAPERS • 2 BENCHMARKS
HAM10000 is a dataset of 10000 training images for detecting pigmented skin lesions. The authors collected dermatoscopic images from different populations, acquired and stored by different modalities.
58 PAPERS • 1 BENCHMARK
The BRATS2017 dataset. It contains 285 brain tumor MRI scans, with four MRI modalities as T1, T1ce, T2, and Flair for each scan. The dataset also provides full masks for brain tumors, with labels for ED, ET, NET/NCR. The segmentation evaluation is based on three tasks: WT, TC and ET segmentation.
55 PAPERS • 1 BENCHMARK
The LUNA16 (LUng Nodule Analysis) dataset is a dataset for lung segmentation. It consists of 1,186 lung nodules annotated in 888 CT scans.
54 PAPERS • 1 BENCHMARK
ChestX-ray8 is a medical imaging dataset which comprises 108,948 frontal-view X-ray images of 32,717 (collected from the year of 1992 to 2015) unique patients with the text-mined eight common disease labels, mined from the text radiological reports via NLP techniques.
53 PAPERS • NO BENCHMARKS YET
The BraTS 2015 dataset is a dataset for brain tumor image segmentation. It consists of 220 high grade gliomas (HGG) and 54 low grade gliomas (LGG) MRIs. The four MRI modalities are T1, T1c, T2, and T2FLAIR. Segmented “ground truth” is provide about four intra-tumoral classes, viz. edema, enhancing tumor, non-enhancing tumor, and necrosis.
51 PAPERS • 1 BENCHMARK
The Medical Segmentation Decathlon is a collection of medical image segmentation datasets. It contains a total of 2,633 three-dimensional images collected across multiple anatomies of interest, multiple modalities and multiple sources. Specifically, it contains data for the following body organs or parts: Brain, Heart, Liver, Hippocampus, Prostate, Lung, Pancreas, Hepatic Vessel, Spleen and Colon.
49 PAPERS • 1 BENCHMARK
PatchCamelyon is an image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annotated with a binary label indicating presence of metastatic tissue. PCam provides a new benchmark for machine learning models: bigger than CIFAR10, smaller than ImageNet, trainable on a single GPU.
43 PAPERS • 2 BENCHMARKS
The Parkinson’s Progression Markers Initiative (PPMI) dataset originates from an observational clinical and longitudinal study comprising evaluations of people with Parkinson’s disease (PD), those people with high risk, and those who are healthy.
43 PAPERS • 2 BENCHMARKS
The PROMISE12 dataset was made available for the MICCAI 2012 prostate segmentation challenge. Magnetic Resonance (MR) images (T2-weighted) of 50 patients with various diseases were acquired at different locations with several MRI vendors and scanning protocols.
42 PAPERS • 1 BENCHMARK
Cholec80 is an endoscopic video dataset containing 80 videos of cholecystectomy surgeries performed by 13 surgeons. The videos are captured at 25 fps and downsampled to 1 fps for processing. The whole dataset is labeled with the phase and tool presence annotations. The phases have been defined by a senior surgeon in Strasbourg hospital, France. Since the tools are sometimes hardly visible in the images and thus difficult to be recognized visually, a tool is defined as present in an image if at least half of the tool tip is visible.
35 PAPERS • 1 BENCHMARK
Kvasir-SEG is an open-access dataset of gastrointestinal polyp images and corresponding segmentation masks, manually annotated by a medical doctor and then verified by an experienced gastroenterologist.
35 PAPERS • 3 BENCHMARKS
The dataset used in this challenge consists of 165 images derived from 16 H&E stained histological sections of stage T3 or T42 colorectal adenocarcinoma. Each section belongs to a different patient, and sections were processed in the laboratory on different occasions. Thus, the dataset exhibits high inter-subject variability in both stain distribution and tissue architecture. The digitization of these histological sections into whole-slide images (WSIs) was accomplished using a Zeiss MIRAX MIDI Slide Scanner with a pixel resolution of 0.465µm.
33 PAPERS • 1 BENCHMARK
The sleep-edf database contains 197 whole-night PolySomnoGraphic sleep recordings, containing EEG, EOG, chin EMG, and event markers. Some records also contain respiration and body temperature. Corresponding hypnograms (sleep patterns) were manually scored by well-trained technicians according to the Rechtschaffen and Kales manual, and are also available.
33 PAPERS • 4 BENCHMARKS
MIMIC-CXR from Massachusetts Institute of Technology presents 371,920 chest X-rays associated with 227,943 imaging studies from 65,079 patients. The studies were performed at Beth Israel Deaconess Medical Center in Boston, MA.
32 PAPERS • 1 BENCHMARK
BRATS 2013 is a brain tumor segmentation dataset consists of synthetic and real images, where each of them is further divided into high-grade gliomas (HG) and low-grade gliomas (LG). There are 25 patients with both synthetic HG and LG images and 20 patients with real HG and 10 patients with real LG images. For each patient, FLAIR, T1, T2, and post-Gadolinium T1 magnetic resonance (MR) image sequences are available.
30 PAPERS • 2 BENCHMARKS
PadChest is a labeled large-scale, high resolution chest x-ray dataset for the automated exploration of medical images along with their associated reports. This dataset includes more than 160,000 images obtained from 67,000 patients that were interpreted and reported by radiologists at Hospital San Juan Hospital (Spain) from 2009 to 2017, covering six different position views and additional information on image acquisition and patient demography. The reports were labeled with 174 different radiographic findings, 19 differential diagnoses and 104 anatomic locations organized as a hierarchical taxonomy and mapped onto standard Unified Medical Language System (UMLS) terminology. Of these reports, 27% were manually annotated by trained physicians and the remaining set was labeled using a supervised method based on a recurrent neural network with attention mechanisms. The labels generated were then validated in an independent test set achieving a 0.93 Micro-F1 score.
30 PAPERS • NO BENCHMARKS YET
CHASE_DB1 is a dataset for retinal vessel segmentation which contains 28 color retina images with the size of 999×960 pixels which are collected from both left and right eyes of 14 school children. Each image is annotated by two independent human experts.
28 PAPERS • 2 BENCHMARKS
Contains hundreds of frontal view X-rays and is the largest public resource for COVID-19 image and prognostic data, making it a necessary resource to develop and evaluate tools to aid in the treatment of COVID-19.
28 PAPERS • NO BENCHMARKS YET
MedMentions is a new manually annotated resource for the recognition of biomedical concepts. What distinguishes MedMentions from other annotated biomedical corpora is its size (over 4,000 abstracts and over 350,000 linked mentions), as well as the size of the concept ontology (over 3 million concepts from UMLS 2017) and its broad coverage of biomedical disciplines.
24 PAPERS • 1 BENCHMARK
The HRF dataset is a dataset for retinal vessel segmentation which comprises 45 images and is organized as 15 subsets. Each subset contains one healthy fundus image, one image of patient with diabetic retinopathy and one glaucoma image. The image sizes are 3,304 x 2,336, with a training/testing image split of 22/23.
20 PAPERS • 1 BENCHMARK
A large dataset of musculoskeletal radiographs containing 40,561 images from 14,863 studies, where each study is manually labeled by radiologists as either normal or abnormal.
19 PAPERS • NO BENCHMARKS YET
The BIOSSES data set comprises total 100 sentence pairs all of which were selected from the "TAC2 Biomedical Summarization Track Training Data Set" .
15 PAPERS • 3 BENCHMARKS
CVC-ClinicDB is an open-access dataset of 612 images with a resolution of 384×288 from 31 colonoscopy sequences.It is used for medical image segmentation, in particular polyp detection in colonoscopy videos.
15 PAPERS • 1 BENCHMARK
MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
15 PAPERS • NO BENCHMARKS YET
CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.
14 PAPERS • 1 BENCHMARK
LiTS17 is a liver tumor segmentation benchmark. The data and segmentations are provided by various clinical sites around the world. The training data set contains 130 CT scans and the test data set 70 CT scans. Image Source: https://arxiv.org/pdf/1707.07734.pdf
13 PAPERS • 2 BENCHMARKS
The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units (ICU). Each record consists of roughly 48 hours of multivariate time series data with up to 37 features recorded at various times from the patients during their stay such as respiratory rate, glucose etc.
13 PAPERS • 4 BENCHMARKS
BRATS 2016 is a brain tumor segmentation dataset. It shares the same training set as BRATS 2015, which consists of 220 HHG and 54 LGG. Its testing dataset consists of 191 cases with unknown grades. Image Source: https://sites.google.com/site/braintumorsegmentation/home/brats_2016
11 PAPERS • NO BENCHMARKS YET
The colorectal nuclear segmentation and phenotypes (CoNSeP) dataset consists of 41 H&E stained image tiles, each of size 1,000×1,000 pixels at 40× objective magnification. The images were extracted from 16 colorectal adenocarcinoma (CRA) WSIs, each belonging to an individual patient, and scanned with an Omnyx VL120 scanner within the department of pathology at University Hospitals Coventry and Warwickshire, UK.
11 PAPERS • NO BENCHMARKS YET
MeQSum is a dataset for medical question summarization. It contains 1,000 summarized consumer health questions.
11 PAPERS • NO BENCHMARKS YET
The National Institutes of Health’s Clinical Center has made a large-scale dataset of CT images publicly available to help the scientific community improve detection accuracy of lesions. While most publicly available medical image datasets have less than a thousand lesions, this dataset, named DeepLesion, has over 32,000 annotated lesions (220GB) identified on CT images. DeepLesion, a dataset with 32,735 lesions in 32,120 CT slices from 10,594 studies of 4,427 unique patients. There are a variety of lesion types in this dataset, such as lung nodules, liver tumors, enlarged lymph nodes, and so on. It has the potential to be used in various medical image applications
10 PAPERS • 1 BENCHMARK
PanNuke is a semi automatically generated nuclei instance segmentation and classification dataset with exhaustive nuclei labels across 19 different tissue types. The dataset consists of 481 visual fields, of which 312 are randomly sampled from more than 20K whole slide images at different magnifications, from multiple data sources. In total the dataset contains 205,343 labeled nuclei, each with an instance segmentation mask.
9 PAPERS • NO BENCHMARKS YET
Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans of were randomly selected from a combination of an ongoing colorectal cancer chemotherapy trial, and a retrospective ventral hernia study. The 50 scans were captured during portal venous contrast phase with variable volume sizes (512 x 512 x 85 - 512 x 512 x 198) and field of views (approx. 280 x 280 x 280 mm3 - 500 x 500 x 650 mm3). The in-plane resolution varies from 0.54 x 0.54 mm2 to 0.98 x 0.98 mm2, while the slice thickness ranges from 2.5 mm to 5.0 mm. The standard registration data was generated by NiftyReg.
8 PAPERS • 1 BENCHMARK
This dataset has 1,842 images with pixel-level DR-related lesion annotations, and 1,000 images with image-level labels graded by six board-certified ophthalmologists with intra-rater consistency. The proposed dataset will enable extensive studies on DR diagnosis.
7 PAPERS • NO BENCHMARKS YET
The MSK dataset is a dataset for lesion recognition from the Memorial Sloan-Kettering Cancer Center. It is used as part of the ISIC lesion recognition challenges.
7 PAPERS • NO BENCHMARKS YET
MosMedData contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. A small subset of studies has been annotated with binary pixel masks depicting regions of interests (ground-glass opacifications and consolidations). CT scans were obtained between 1st of March, 2020 and 25th of April, 2020, and provided by municipal hospitals in Moscow, Russia.
7 PAPERS • NO BENCHMARKS YET
The Diabetic Foot Ulcers dataset (DFUC2021) is a dataset for analysis of pathology, focusing on infection and ischaemia. The final release of DFUC2021 consists of 15,683 DFU patches, with 5,955 training, 5,734 for testing and 3,994 unlabeled DFU patches. The ground truth labels are four classes, i.e. control, infection, ischaemia and both conditions.
6 PAPERS • NO BENCHMARKS YET