How to Source ECG Datasets for Arrhythmia Detection Models

Why arrhythmia models live or die on their labels

An arrhythmia detector is only as good as the rhythm labels it learns from. Billing codes and automated machine annotations are convenient but noisy; expert-adjudicated labels — ideally with a cardiologist over-read — are what let a model separate atrial fibrillation from artifact or from frequent ectopy.

What a training-ready ECG cohort looks like

Lead configuration documented: a 12-lead resting study and a single-lead patch are very different inputs — do not mix them without labelling the source.
Sampling rate and units recorded: 250–1000 Hz waveforms in WFDB or EDF, with gain and baseline so signals reconstruct in millivolts.
Balanced rhythm classes: real-world prevalence is heavily skewed toward sinus rhythm; over-sampling rare arrhythmias prevents a model that simply predicts ‘normal’.

De-identification and metadata

Recordings must be stripped of protected health information while preserving the waveform and the acquisition metadata models rely on. Document the de-identification method so downstream users can audit it.

When public datasets fall short

Public corpora rarely match a specific target population or label taxonomy. On GetDATA you can post a request specifying the lead set, rhythm classes, minimum counts and balance you need, and verified providers contribute matching, de-identified recordings.

How to Source ECG Datasets for Arrhythmia Detection Models

Why arrhythmia models live or die on their labels

What a training-ready ECG cohort looks like

De-identification and metadata

When public datasets fall short

Need a specific medical dataset?