12,000 Resting 12-Lead ECG Recordings with Expert Atrial Fibrillation Annotations
OpenOverview
We are seeking a large, well-annotated dataset of resting 12-lead ECG recordings from adult patients to train and validate a deep-learning classifier for atrial fibrillation (AF) detection. The intended model architecture is a convolutional-recurrent network that operates directly on raw voltage traces, and its performance is highly sensitive to dataset size, annotation quality, and demographic diversity. Each recording must capture all twelve standard leads (I, II, III, aVR, aVL, aVF, V1–V6) at a minimum sampling rate of 500 Hz, with amplitude resolution of at least 1 μV (12-bit ADC or better). Recording duration must be ≥10 seconds per strip; longer 30-second captures are strongly preferred. Accepted file formats are WFDB (PhysioNet/MIT-BIH style header + signal files) or EDF; CSV exports with a standardised column schema are acceptable as a secondary option. Each record must be accompanied by a cardiologist-confirmed rhythm label — at minimum a binary AF / non-AF tag — with additional labels for flutter, supraventricular tachycardia, normal sinus rhythm, and sinus bradycardia strongly preferred. Keypoint annotations marking P-wave onset and offset, QRS complex onset, peak, and offset, and T-wave end are highly desirable for training auxiliary tasks. The labeling protocol must follow a two-stage review: a primary annotation produced by a board-certified cardiologist or credentialed cardiac physiologist, followed by independent over-read by a second annotator; disagreements must be adjudicated by a senior electrophysiologist. Inter-rater agreement (Cohen's kappa) should be reported per rhythm class and included in the dataset documentation. All annotations must use a standardised label taxonomy aligned with the AHA/ACC ECG terminology guidelines to ensure compatibility with publicly available benchmarks such as PhysioNet Challenge datasets and the MIMIC-IV-ECG corpus. De-identification must satisfy HIPAA Safe Harbour or an equivalent EU GDPR pseudonymisation standard: no patient name, date of birth, facility name, or accession numbers may appear in signal file headers or companion metadata files. Any free-text physician notes attached to the recording must be scrubbed using a validated PHI-detection NLP pipeline before delivery. Age bucket (decade), biological sex, and comorbidity flags (hypertension, heart failure, diabetes) should be retained as structured metadata fields. Quality exclusion criteria: recordings with any of the following must be flagged or removed — electrode reversal artefact detectable from lead polarity inversion, lead-off noise affecting more than two contiguous leads, baseline wander exceeding 0.5 mV peak-to-peak, or signal clipping. We require a balanced AF-to-non-AF ratio of no worse than 1:3, and we encourage inclusion of paroxysmal AF cases captured during or immediately after an episode, as these are clinically the hardest to classify and most valuable for model generalisation. Demographic balance across age groups (18–40, 41–60, 61–80, >80 years) and sex is mandatory. Downstream use cases include a real-time AF alert integrated into hospital ECG cart software, a cloud-based clinical decision support API, and federated training experiments across multiple institution nodes.
Progress
Data Specifications
| Category | Sensor / device data |
|---|---|
| Required quantity | 12000 |
| Data types | Sensor / device data, ECG, Cardiac, EDF, WFDB |
| Budget | USD 54000.00 |
| Deadline | 2026-09-30 |
Use Cases
- Training and validating Sensor / device data AI/ML models
- Benchmarking Sensor / device data detection and segmentation algorithms
- Building de-identified Sensor / device data research datasets for academic studies
- Augmenting existing Sensor / device data datasets to reduce class imbalance