MedQuAD includes 47,457 medical question-answer pairs created from 12 NIH websites (e.g. cancer.gov, niddk.nih.gov, GARD, MedlinePlus Health Topics). The collection covers 37 question types (e.g. Treatment, Diagnosis, Side Effects) associated with diseases, drugs and other medical entities such as tests.
23 PAPERS • NO BENCHMARKS YET
Introduced by Da et al. in DigestPath: a Benchmark Dataset with Challenge Review for the Pathological Detection and Segmentation of Digestive-System
22 PAPERS • 1 BENCHMARK
Retinal OCTA SEgmentation dataset (ROSE) consists of 229 OCTA images with vessel annotations at either centerline-level or pixel level.
22 PAPERS • 4 BENCHMARKS
SegTHOR (Segmentation of THoracic Organs at Risk) is a dataset dedicated to the segmentation of organs at risk (OARs) in the thorax, i.e. the organs surrounding the tumour that must be preserved from irradiations during radiotherapy. In this dataset, the OARs are the heart, the trachea, the aorta and the esophagus, which have varying spatial and appearance characteristics. The dataset includes 60 3D CT scans, divided into a training set of 40 and a test set of 20 patients, where the OARs have been contoured manually by an experienced radiotherapist.
22 PAPERS • NO BENCHMARKS YET
The ISIC 2018 dataset was published by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images. This Task 1 dataset is the challenge on lesion segmentation. It includes 2594 images.
21 PAPERS • 1 BENCHMARK
Under Institutional Review Board (IRB) supervision, 50 abdomen CT scans of were randomly selected from a combination of an ongoing colorectal cancer chemotherapy trial, and a retrospective ventral hernia study. The 50 scans were captured during portal venous contrast phase with variable volume sizes (512 x 512 x 85 - 512 x 512 x 198) and field of views (approx. 280 x 280 x 280 mm3 - 500 x 500 x 650 mm3). The in-plane resolution varies from 0.54 x 0.54 mm2 to 0.98 x 0.98 mm2, while the slice thickness ranges from 2.5 mm to 5.0 mm. The standard registration data was generated by NiftyReg.
21 PAPERS • 3 BENCHMARKS
PMC-VQA is a large-scale medical visual question-answering dataset that contains 227k VQA pairs of 149k images that cover various modalities or diseases. The question-answer pairs are generated from PMC-OA.
CholecT50 is a dataset of endoscopic videos of laparoscopic cholecystectomy surgery introduced to enable research on fine-grained action recognition in laparoscopic surgery. It is annotated with triplet information in the form of <instrument, verb, target>. The dataset is a collection of 50 videos consisting of 45 videos from the Cholec80 dataset and 5 videos from an in-house dataset of the same surgical procedure.
20 PAPERS • 7 BENCHMARKS
IXI Dataset is a collection of 600 MR brain images from normal, healthy subjects. The MR image acquisition protocol for each subject includes:
20 PAPERS • 4 BENCHMARKS
The MS-CXR dataset provides 1162 image–sentence pairs of bounding boxes and corresponding phrases, collected across eight different cardiopulmonary radiological findings, with an approximately equal number of pairs for each finding. This dataset complements the existing MIMIC-CXR v.2 dataset and comprises: 1. Reviewed and edited bounding boxes and phrases (1026 pairs of bounding box/sentence); and 2. Manual bounding box labels from scratch (136 pairs of bounding box/sentence).e
20 PAPERS • NO BENCHMARKS YET
WORD is a dataset for organ semantic segmentation that contains 150 abdominal CT volumes (30,495 slices) and each volume has 16 organs with fine pixel-level annotations and scribble-based sparse annotation, which may be the largest dataset with whole abdominal organs annotation.
MosMedData contains anonymised human lung computed tomography (CT) scans with COVID-19 related findings, as well as without such findings. A small subset of studies has been annotated with binary pixel masks depicting regions of interests (ground-glass opacifications and consolidations). CT scans were obtained between 1st of March, 2020 and 25th of April, 2020, and provided by municipal hospitals in Moscow, Russia.
19 PAPERS • 1 BENCHMARK
The PhysioNet Challenge 2012 dataset is publicly available and contains the de-identified records of 8000 patients in Intensive Care Units (ICU). Each record consists of roughly 48 hours of multivariate time series data with up to 37 features recorded at various times from the patients during their stay such as respiratory rate, glucose etc.
19 PAPERS • 5 BENCHMARKS
CliCR is a new dataset for domain specific reading comprehension used to construct around 100,000 cloze queries from clinical case reports.
18 PAPERS • 1 BENCHMARK
The LC25000 dataset contains 25,000 color images with 5 classes of 5,000 images each. All images are 768 x 768 pixels in size and are in jpeg file format. The 5 classes are: colon adenocarcinomas, benign colonic tissues, lung adenocarcinomas, lung squamous cell carcinomas and bening lung tissues.
17 PAPERS • NO BENCHMARKS YET
**CrossMoDA is a large and multi-class benchmark for unsupervised cross-modality Domain Adaptation. The goal of the challenge is to segment two key brain structures involved in the follow-up and treatment planning of vestibular schwannoma (VS): the VS and the cochleas. Currently, the diagnosis and surveillance in patients with VS are commonly performed using contrast-enhanced T1 (ceT1) MR imaging.
16 PAPERS • NO BENCHMARKS YET
The National Institutes of Health’s Clinical Center has made a large-scale dataset of CT images publicly available to help the scientific community improve detection accuracy of lesions. While most publicly available medical image datasets have less than a thousand lesions, this dataset, named DeepLesion, has over 32,000 annotated lesions (220GB) identified on CT images. DeepLesion, a dataset with 32,735 lesions in 32,120 CT slices from 10,594 studies of 4,427 unique patients. There are a variety of lesion types in this dataset, such as lung nodules, liver tumors, enlarged lymph nodes, and so on. It has the potential to be used in various medical image applications
16 PAPERS • 1 BENCHMARK
MedICaT is a dataset of medical images, captions, subfigure-subcaption annotations, and inline textual references. Figures and captions are extracted from open access articles in PubMed Central and corresponding reference text is derived from S2ORC. The dataset consists of: 217,060 figures from 131,410 open access papers 7507 subcaption and subfigure annotations for 2069 compound figures Inline references for ~25K figures in the ROCO dataset
The ECGs in this collection were obtained using a non-commercial, PTB prototype recorder with the following specifications:
16 PAPERS • 4 BENCHMARKS
BReAst Carcinoma Subtyping (BRACS) dataset, a large cohort of annotated Hematoxylin & Eosin (H&E)-stained images to facilitate the characterization of breast lesions. BRACS contains 547 Whole-Slide Images (WSIs), and 4539 Regions of Interest (ROIs) extracted from the WSIs. Each WSI, and respective ROIs, are annotated by the consensus of three board-certified pathologists into different lesion categories. Specifically, BRACS includes three lesion types, i.e., benign, malignant and atypical, which are further subtyped into seven categories.
15 PAPERS • NO BENCHMARKS YET
Indian Diabetic Retinopathy Image Dataset (IDRiD) dataset consists of typical diabetic retinopathy lesions and normal retinal structures annotated at a pixel level. This dataset also provides information on the disease severity of diabetic retinopathy and diabetic macular edema for each image. This dataset is perfect for the development and evaluation of image analysis algorithms for early detection of diabetic retinopathy.
14 PAPERS • 3 BENCHMARKS
SICAPv2 is a database containing prostate histology whole slide images with both annotations of global Gleason scores and path-level Gleason grades.
14 PAPERS • NO BENCHMARKS YET
BRATS 2016 is a brain tumor segmentation dataset. It shares the same training set as BRATS 2015, which consists of 220 HHG and 54 LGG. Its testing dataset consists of 191 cases with unknown grades. Image Source: https://sites.google.com/site/braintumorsegmentation/home/brats_2016
13 PAPERS • NO BENCHMARKS YET
The ISIC 2017 dataset was published by the International Skin Imaging Collaboration (ISIC) as a large-scale dataset of dermoscopy images. The Task 1 challenge dataset for lesion segmentation contains 2,000 images for training with ground truth segmentations (2000 binary mask images).
Consists of annotated frames containing GI procedure tools such as snares, balloons and biopsy forceps, etc. Beside of the images, the dataset includes ground truth masks and bounding boxes and has been verified by two expert GI endoscopists.
13 PAPERS • 3 BENCHMARKS
REFUGE Challenge provides a data set of 1200 fundus images with ground truth segmentations and clinical glaucoma labels, currently the largest existing one.
13 PAPERS • 5 BENCHMARKS
BrixIA Covid-19 is a large dataset of CXR images corresponding to the entire amount of images taken for both triage and patient monitoring in sub-intensive and intensive care units during one month (between March 4th and April 4th 2020) of pandemic peak at the ASST Spedali Civili di Brescia, and contains all the variability originating from a real clinical scenario. It includes 4,707 CXR images of COVID-19 subjects, acquired with both CR and DX modalities, in AP or PA projection, and retrieved from the facility RIS-PACS system.
12 PAPERS • NO BENCHMARKS YET
Chaoyang dataset contains 1111 normal, 842 serrated, 1404 adenocarcinoma, 664 adenoma, and 705 normal, 321 serrated, 840 adenocarcinoma, 273 adenoma samples for training and testing, respectively. This noisy dataset is constructed in the real scenario.
12 PAPERS • 2 BENCHMARKS
The Endomapper dataset is the first collection of complete endoscopy sequences acquired during regular medical practice, including slow and careful screening explorations, making secondary use of medical data. Its original purpose is to facilitate the development and evaluation of VSLAM (Visual Simultaneous Localization and Mapping) methods in real endoscopy data. The first release of the dataset is composed of 50 sequences with a total of more than 13 hours of video. It is also the first endoscopic dataset that includes both the computed geometric and photometric endoscope calibration as well as the original calibration videos. Meta-data and annotations associated to the dataset varies from anatomical landmark and description of the procedure labeling, tools segmentation masks, COLMAP 3D reconstructions, simulated sequences with groundtruth and meta-data related to special cases, such as sequences from the same patient. This information will improve the research in endoscopic VSLAM, a
This dataset has 1,842 images with pixel-level DR-related lesion annotations, and 1,000 images with image-level labels graded by six board-certified ophthalmologists with intra-rater consistency. The proposed dataset will enable extensive studies on DR diagnosis.
Retrospectively collected medical data has the opportunity to improve patient care through knowledge discovery and algorithm development. Broad reuse of medical data is desirable for the greatest public good, but data sharing must be done in a manner which protects patient privacy.
The MSK dataset is a dataset for lesion recognition from the Memorial Sloan-Kettering Cancer Center. It is used as part of the ISIC lesion recognition challenges.
The SUN-SEG dataset is a high-quality per-frame annotated VPS dataset, which includes 158,690 frames from the famous SUN dataset. It extends the labels with diverse types, i.e., object mask, boundary, scribble, polygon, and visual attribute. It also introduces the pathological information from the original SUN dataset, including pathological classification labels, location information, and shape information.
12 PAPERS • 1 BENCHMARK
This CBIS-DDSM (Curated Breast Imaging Subset of DDSM) is an updated and standardized version of the Digital Database for Screening Mammography (DDSM) . The DDSM is a database of 2,620 scanned film mammography studies. It contains normal, benign, and malignant cases with verified pathology information. The scale of the database along with ground truth validation makes the DDSM a useful tool in the development and testing of decision support systems. The CBIS-DDSM collection includes a subset of the DDSM data selected and curated by a trained mammographer. The images have been decompressed and converted to DICOM format. Updated ROI segmentation and bounding boxes, and pathologic diagnosis for training data are also included. A manuscript describing how to use this dataset in detail is available at https://www.nature.com/articles/sdata2017177.
11 PAPERS • 2 BENCHMARKS
The Respiratory Sound database was originally compiled to support the scientific challenge organized at Int. Conf. on Biomedical Health Informatics - ICBHI 2017.
11 PAPERS • 1 BENCHMARK
The ACNE04 dataset includes 3756 Chinese face images with Acne. The ACNE04 dataset includes the annotations of local lesion numbers and global acne severity based on Hayashi Criterion.
10 PAPERS • 1 BENCHMARK
ADAM is organized as a half day Challenge, a Satellite Event of the ISBI 2020 conference in Iowa City, Iowa, USA.
10 PAPERS • 2 BENCHMARKS
The evaluation of human epidermal growth factor receptor 2 (HER2) expression is essential to formulate a precise treatment for breast cancer. The routine evaluation of HER2 is conducted with immunohistochemical techniques (IHC), which is very expensive. Therefore, we propose a breast cancer immunohistochemical (BCI) benchmark attempting to synthesize IHC data directly with the paired hematoxylin and eosin (HE) stained images. The dataset contains 4870 registered image pairs, covering a variety of HER2 expression levels (0, 1+, 2+, 3+).
BCN_20000 is a dataset composed of 19,424 dermoscopic images of skin lesions captured from 2010 to 2016 in the facilities of the Hospital Clínic in Barcelona. The dataset can be used for lesion recognition tasks such as lesion segmentation, lesion detection and lesion classification.
10 PAPERS • NO BENCHMARKS YET
HyperKvasir dataset contains 110,079 images and 374 videos where it captures anatomical landmarks and pathological and normal findings. A total of around 1 million images and video frames altogether.
Diabetic retinopathy is the leading cause of blindness in the working-age population of the developed world. It is estimated to affect over 93 million people.
The MM-WHS 2017 dataset is a dataset for multi-modality whole heart segmentation. It provides 20 labeled and 40 unlabeled CT volumes, as well as 20 labeled and 40 unlabeled MR volumes. In total there are 120 multi-modality cardiac images acquired in a real clinical environment.
The Sixth Informatics for Integrating Biology and the Bedside (i2b2) Natural Language Processing Challenge for Clinical Records focused on the temporal relations in clinical narratives. The organizers provided the research community with a corpus of discharge summaries annotated with temporal information, to be used for the development and evaluation of temporal reasoning systems. 18 teams from around the world participated in the challenge. During the workshop, participating teams presented comprehensive reviews and analysis of their systems, and outlined future research directions suggested by the challenge contributions.
9 PAPERS • 2 BENCHMARKS
We release expert-made scribble annotations for the medical ACDC dataset 1. The released data must be considered as extending the original ACDC dataset. The ACDC dataset contains cardiac MRI images, paired with hand-made segmentation masks. It is possible to use the segmentation masks provided in the ACDC dataset to evaluate the performance of methods trained using only scribble supervision.
9 PAPERS • 1 BENCHMARK
LoDoPaB-CT is a dataset of computed tomography images and simulated low-dose measurements. It contains over 40,000 scan slices from around 800 patients selected from the LIDC/IDRI Database.
The MedVidQA dataset contains the collection of 3, 010 manually created health-related questions and timestamps as visual answers to those questions from trusted video sources, such as accredited medical schools with an established reputation, health institutes, health education, and medical practitioners.
9 PAPERS • NO BENCHMARKS YET