PhysioNet Challenge 2021 (The PhysioNet/Computing in Cardiology Challenge 2021)

Data Description

The training data contains twelve-lead ECGs. The validation and test data contains twelve-lead, six-lead, four-lead, three-lead, and two-lead ECGs:

  1. Twelve leads: I, II, III, aVR, aVL, aVF, V1, V2, V3, V4, V5, V6
  2. Six leads: I, II, III, aVR, aVL, aVF
  3. Four leads: I, II, III, V2
  4. Three leads: I, II, V2
  5. Two leads: I, II

Each ECG recording has one or more labels that describe cardiac abnormalities (and/or a normal sinus rhythm). We mapped the labels for each recording to SNOMED-CT codes. The lists of scored labels and unscored labels are given with the evaluation code; see the scoring section for details.

Data Sources

The Challenge data include recordings from last year’s Challenge and many new recordings for this year’s Challenge:

  1. CPSC Database and CPSC-Extra Database
  2. INCART Database
  3. PTB and PTB-XL Database
  4. The Georgia 12-lead ECG Challenge (G12EC) Database
  5. Augmented Undisclosed Database
  6. Chapman-Shaoxing and Ningbo Database
  7. The University of Michigan (UMich) Database

The Challenge data include annotated twelve-lead ECG recordings from six sources in four countries across three continents. These databases include over 100,000 twelve-lead ECG recordings with over 88,000 ECGs shared publicly as training data, 6,630 ECGs retained privately as validation data, and 36,266 ECGs retained privately as test data.

  • The first source is the China Physiological Signal Challenge in 2018 (CPSC 2018), which was held during the 7th International Conference on Biomedical Engineering and Biotechnology in Nanjing, China. This source contains two databases: the data from CPSC 2018 (the CPSC Database) and unused data from CPSC 2018 (the CPSC-Extra Database). Together, these databases contain 13,256 ECGs (10,330 ECGs shared as training data, 1,463 retained as validation data, and 1,463 retained as test data). We shared the training set and an unused dataset from CPSC 2018 as training data, and we split the test set from CPSC 2018 into validation and test sets. Each recording is between 6 and 144 seconds long with a sampling frequency of 500 Hz.

  • The second source is the St Petersburg INCART 12-lead Arrhythmia Database. This source contains 74 annotated ECGs (all shared as training data) extracted from 32 Holter monitor recordings. Each recording is 30 minutes long with a sampling frequency of 257 Hz.

  • The third source is the Physikalisch-Technische Bundesanstalt (PTB) and includes two public datasets: the PTB and the PTB-XL databases. The source contains 22,353 ECGs (all shared as training data). Each recording is between 10 and 120 seconds long with a sampling frequency of either 500 or 1,000 Hz.

  • The fourth source is a Georgia database which represents a unique demographic of the Southeastern United States. This source contains 20,672 ECGs (10,344 ECGs shared as training data, 5,167 retained as validation data, and 5,161 retained as test data). Each recording is between 5 and 10 seconds long with a sampling frequency of 500 Hz.

  • The fifth source is an undisclosed American database that is geographically distinct from the Georgia database. This source contains 10,000 ECGs (all retained as test data).

  • The sixth source is the Chapman University, Shaoxing People’s Hospital (Chapman-Shaoxing) and Ningbo First Hospital (Ningbo) database. This source contains 45,152 ECGS (all shared as training data). Each recording is 10 seconds long with a sampling frequency of 500 Hz.

  • The seventh source is UMich Database from the University of Michigan. This source contains 19,642 ECGs (all retained as test data). Each recording is 10 seconds long with a sampling frequency of either 250 Hz or 500 Hz.

Like other real-world datasets, different databases may have different proportions of cardiac abnormalities, but all of the labels in the validation or test data are represented in the training data. Moreover, while this is a curated dataset, some of the data and labels are likely to have errors, and an important part of the Challenge is to work out these issues. In particular, some of the databases have human-overread machine labels with single or multiple human readers, so the quality of the labels varies between databases. You can find more information about the label mappings of the Challenge training data in this table.

The six-lead, four-lead, three-lead, and two-lead validation data are reduced-lead versions of the twelve-lead validation data: the same recordings with the same header data but only with signal data for the relevant leads.

We are not planning to release the test data at any point, including after the end of the Challenge. Requests for the test data will not receive a response. We do not release test data to prevent overfitting on the test data and claims or publications of inflated performances. We will entertain requests to run code on the test data after the Challenge on a limited basis based on publication necessity and capacity. (The Challenge is largely staged by volunteers.)

Data Format

All data was formatted in WFDB format. Each ECG recording uses a binary MATLAB v4 file (see page 27) for the ECG signal data and a plain text file in WFDB header format for the recording and patient attributes, including the diagnosis, i.e., the labels for the recording. The binary files can be read using the load function in MATLAB and the scipy.io.loadmat function in Python; see our MATLAB and Python example code for working examples. The first line of the header provides information about the total number of leads and the total number of samples or time points per lead, the following lines describe how each lead was encoded, and the last lines provide information on the demographics and diagnosis of the patient.

For example, a header file A0001.hea may have the following contents:

A0001 12 500 7500 05-Feb-2020 11:39:16
A0001.mat 16+24 1000/mV 16 0 28 -1716 0 I
A0001.mat 16+24 1000/mV 16 0 7 2029 0 II
A0001.mat 16+24 1000/mV 16 0 -21 3745 0 III
A0001.mat 16+24 1000/mV 16 0 -17 3680 0 aVR
A0001.mat 16+24 1000/mV 16 0 24 -2664 0 aVL
A0001.mat 16+24 1000/mV 16 0 -7 -1499 0 aVF
A0001.mat 16+24 1000/mV 16 0 -290 390 0 V1
A0001.mat 16+24 1000/mV 16 0 -204 157 0 V2
A0001.mat 16+24 1000/mV 16 0 -96 -2555 0 V3
A0001.mat 16+24 1000/mV 16 0 -112 49 0 V4
A0001.mat 16+24 1000/mV 16 0 -596 -321 0 V5
A0001.mat 16+24 1000/mV 16 0 -16 -3112 0 V6
#Age: 74
#Sex: Male
#Dx: 426783006
#Rx: Unknown
#Hx: Unknown
#Sx: Unknown

From the first line of the file, we see that the recording number is A0001, and the recording file is A0001.mat. The recording has 12 leads, each recorded at a 500 Hz sampling frequency, and contains 7500 samples. From the next 12 lines of the file (one for each lead), we see that each signal was written at 16 bits with an offset of 24 bits, the floating point number (analog-to-digital converter (ADC) units per physical unit) is 1000/mV, the resolution of the analog-to-digital converter (ADC) used to digitize the signal is 16 bits, and the baseline value corresponding to 0 physical units is 0. The first value of the signal (-1716, etc.), the checksum (0, etc.), and the lead name (I, etc.) are the last three entries of each of these lines. From the final 6 lines, we see that the patient is a 74-year-old male with a diagnosis (Dx) of 426783006, which is the SNOMED-CT code for sinus rhythm. The medical prescription (Rx), history (Hx), and symptom or surgery (Sx) are unknown. Please visit WFDB header format for more information on the header file and variables.

Papers


Paper Code Results Date Stars

Dataset Loaders


Tasks


Similar Datasets