ECG Datasets — Request Electrocardiogram Data

Electrocardiogram (ECG, also EKG) datasets record the heart's electrical activity as voltage-versus-time signals captured from electrodes placed on the body. They are foundational for training and validating machine-learning models in cardiology, remote patient monitoring, and wearable health technology. A typical ECG dataset spans multiple acquisition formats: standard resting 12-lead ECGs, single-lead and reduced-lead recordings from smartwatches and patch monitors, continuous ambulatory Holter recordings lasting 24 to 48 hours, and short rhythm strips captured during symptomatic episodes.

Signals are commonly stored as WFDB, DICOM-ECG, HL7 aECG, or CSV waveform files, sampled at 250 to 1000 Hz, and accompanied by lead configuration, sampling rate, and calibration metadata. Clinically meaningful ECG datasets include expert annotations of the P wave, QRS complex, and T wave, along with interval measurements such as PR, QRS duration, QT, and corrected QT (QTc). Label schemas often cover normal sinus rhythm and a wide range of abnormalities: atrial fibrillation and flutter, premature atrial and ventricular contractions, supraventricular and ventricular tachycardia, first, second, and third degree AV block, bundle branch blocks, ST-segment elevation and depression, T-wave inversion, and signs of myocardial ischemia or infarction.

High-quality cohorts are demographically balanced across age, sex, and comorbidities, and are de-identified to remove protected health information while preserving diagnostic fidelity. On GetDATA, researchers and medtech companies post requests describing the exact modality, lead set, sampling rate, label taxonomy, class balance, and minimum sample counts they need, and verified hospitals and labs fulfill those requests with compliant, quality-scored electrocardiogram data. Whether you are building arrhythmia classifiers, QT-prolongation screening tools, or wearable rhythm-detection algorithms, sourcing well-annotated ECG datasets is the difference between a model that generalizes and one that fails silently in production.

Common benchmarks and standards include the AAMI EC57 protocol, the PhysioNet/Computing in Cardiology challenges, and SNOMED-coded diagnostic statements, which help align labels across institutions and make datasets interoperable for federated training. Browse the open electrocardiogram data requests below, or explore related cardiac and imaging categories.

Open ECG requests

50,000 Single-Lead Wearable ECG Strips for Large-Scale Atrial Fibrillation Population Screening

Open

Consumer and clinical-grade wearable devices — smartwatches, chest patches, and handheld recorders — are increasingly used for opportunistic AF screening in primary care and community settings. However, models trained on clinical 12-lead ECGs perform poorly on single-lead data because of electrode placement variability, motion artefact, and the absence of spatial voltage information. We are developing a dedicated single-lead AF detection model targeting deployment in FDA Class II-cleared wearable devices. We require 50,000 single-lead ECG recordings, each equivalent to Lead I or a modified limb-lead configuration, with recording durations of 30 seconds to 5 minutes per strip. Minimum sampling rate is 200 Hz; 256 Hz or 300 Hz (typical of consumer optical-to-electrical biosignal chips) is preferred. Amplitude resolution of ≥8-bit is the floor; 12-bit is preferred. Preferred formats are CSV (column-per-channel with ISO 8601 timestamp) or JSON with signal array, sample-rate field, and metadata object. Data may originate from any cleared handheld or wrist-worn single-lead recorder (AliveCor KardiaMobile, Withings ScanWatch, Zio patch, or equivalent clinical Holter export truncated to single channel). Each strip must carry a rhythm label: AF confirmed, AF not present, technically inadequate or excessive artefact. Labels must be generated by a certified cardiac physiologist or electrophysiologist, not by the device own algorithm, to avoid label noise from the very systems our model aims to replace. The labeling protocol requires human expert review using a validated browser-based or desktop annotation platform displaying the raw waveform; annotators must be blinded to the device automatic interpretation. A minimum of 5% of all strips must undergo dual independent annotation for inter-rater reliability assessment; Cohen's kappa for the AF-confirmed versus AF-not-present binary decision must be ≥0.80. Strips flagged as technically inadequate must also be reviewed by a second annotator before final labeling, as false inadequacy labeling artificially inflates the rejection rate and degrades training signal. Strips with significant baseline wander, muscle artefact, or lead-off events are valuable as hard negatives and should be labelled as technically inadequate rather than discarded. Because wearable recordings are inherently susceptible to high-frequency motion noise during physical activity, recordings captured during walking, stair climbing, or light exercise (documented by device accelerometer data if available) are specifically solicited to build robustness at inference time. QRS morphology characteristics such as irregular RR intervals, absent P-waves, fibrillatory baseline, and variable QRS amplitude — the hallmarks of AF in single-lead traces — should be used as secondary annotation cues and documented in per-strip quality notes. De-identification must comply with HIPAA Safe Harbour or GDPR Article 89 pseudonymisation. All device-embedded PHI (patient name, date of birth, device serial number traceable to a named individual) must be removed or replaced with surrogate identifiers before delivery. Recordings must not include GPS coordinates or location data, even in embedded metadata fields. Subject-level metadata should include age, sex, BMI, and known AF history (paroxysmal, persistent, permanent, or no known AF), as these features will be used as auxiliary inputs to the model. Atrial fibrillation subtype (paroxysmal versus persistent versus permanent) must be documented where known, as paroxysmal AF episodes captured mid-episode represent the highest clinical value and are the most challenging to detect. All data must be de-identified per HIPAA or GDPR standards. We anticipate an AF prevalence of 15–25% in the supplied dataset, reflecting a screening-enriched population rather than a general community sample. Downstream use cases include a consumer AF detection app embedded in a cleared smartwatch, a primary-care nurse-administered screening kiosk, population-level epidemiological AF prevalence tracking via wearable aggregation, and federated model training across wearable device manufacturer partnerships without centralising raw patient data.

Sensor / device dataECGCSVJSON
0 / 50000 scans0%

6,000 Pediatric 12-Lead ECGs Across Age Groups from Neonates to Adolescents with Diagnostic Labels

Open

Pediatric cardiology is a critically underserved domain in AI-driven ECG interpretation because pediatric ECG morphology differs substantially from adult norms: higher resting heart rates, right-ventricular dominance in neonates, evolving QRS axis, and age-specific QTc reference ranges all mean that models trained on adult datasets perform poorly in children. We are building the first large-scale, age-stratified pediatric ECG AI classifier to screen for congenital heart disease, inherited channelopathies, and acquired conditions including Kawasaki disease and myocarditis. We require 6,000 resting 12-lead ECG recordings from patients aged 0–17 years, with the following minimum stratification: neonates and infants (0–12 months) 1,000 recordings, toddlers and pre-school (1–5 years) 1,000 recordings, school-age (6–12 years) 2,000 recordings, and adolescents (13–17 years) 2,000 recordings. Sampling rate must be ≥500 Hz; paper-speed equivalent of 25 mm/s and gain of 10 mm/mV must be documented. EDF or WFDB formats are required. Recording duration ≥10 seconds; longer strips preferred for rhythm assessment. Each record must include cardiologist diagnostic labels from the following categories: normal for age, right bundle branch block, left ventricular hypertrophy, Wolff-Parkinson-White pattern, long-QT syndrome, supraventricular tachycardia, complete AV block, and congenital heart disease (specifying anatomy where known). Reports or structured cardiology findings summaries should accompany records where available, as these provide essential contextual supervision signal. The labeling protocol must be carried out exclusively by board-certified pediatric cardiologists. Primary interpretation is performed by a pediatric cardiology fellow or attending with electrophysiology training; all abnormal findings must be over-read and confirmed by a senior pediatric cardiologist. Diagnostic criteria must be referenced to published age-normative tables (Davignon, Rijnbeek, or equivalent peer-reviewed pediatric ECG reference ranges) because QRS duration, QTc limits, and R-wave amplitude thresholds differ substantially between age groups. Inter-annotator agreement must be assessed for a minimum 10% random subsample, with Cohen's kappa reported per diagnostic category and documented in the data release. Acquisition parameters must be fully documented per record: device manufacturer and model, paper speed setting, gain setting (typically 10 mm/mV for standard leads, 5 mm/mV for high-amplitude neonatal tracings), electrode placement protocol (standard limb positions or pediatric chest electrode spacing), and patient cooperation level (resting/awake, sleeping, or crying — since motion artefact in infants is a major confound). QTc values must be calculated using the Bazett correction for comparison with age-normative ranges, with the raw QT and preceding RR interval also provided. De-identification is strictly required under HIPAA or equivalent national regulation. Because pediatric patients are a protected class, particular care must be taken to remove any free-text that could identify the child or parent. Only age in months for those under two years, or age in completed years, sex assigned at birth, body weight percentile, and relevant metabolic or genetic screening results (e.g., channelopathy gene panel result if available) should be retained as metadata. Strict de-identification is required; only age in months (for those under two years) or age in completed years, sex assigned at birth, and body weight percentile should be retained as metadata. We strongly encourage participation from tertiary paediatric cardiac centres, as these institutions concentrate the rare diagnoses most valuable to the classifier. Downstream use cases include deployment as a screening decision-support tool in general pediatric clinics and neonatal intensive care units, integration with wearable infant cardiac monitors, and a federated learning study across multiple pediatric hospitals to address data rarity.

Sensor / device dataECGEDFWFDB
0 / 6000 scans0%

5,000 Serial 12-Lead ECGs for QT-Interval Prolongation and Drug-Induced Arrhythmia Safety Monitoring

Open

A pharmaceutical research organisation is compiling a reference dataset to train and benchmark automated QTc-interval measurement algorithms intended for use in ICH E14-compliant thorough QT studies and ongoing cardiac safety surveillance during drug development. We require serial 12-lead ECG recordings from adult patients or healthy volunteers collected under controlled, medically supervised conditions. Each subject should contribute at least three recordings at defined time points (pre-dose baseline, peak plasma concentration, and ≥4-hour post-dose or equivalent); paired time-point recordings are essential for QT correction modelling. Sampling rate must be ≥1000 Hz to support accurate automated beat detection and interval measurement; 500 Hz is the minimum acceptable threshold. Amplitude resolution must be ≥1 μV. Files must be provided in EDF or CSV format with explicit timestamp alignment between recordings from the same subject. Lead II and the precordial leads (V1–V6) are the primary measurement channels. For each recording we require: automated and over-read cardiologist QTc measurements (Bazett and Fridericia correction), individual beat-level RR intervals and QT intervals (minimum 10 beats averaged), morphology flags for T-wave alternans, U-wave presence, and bifid T-wave, and an overall interpretive statement. Keypoint annotations for P-wave onset, QRS onset, and T-wave offset (tangent method) on Lead II are mandatory for algorithm benchmarking. The annotation labeling protocol must comply with ISCE/ISHNE and ICH E14 guidance on ECG interval measurement in drug studies. Primary QT and QTc measurements must be performed by a trained ECG reader using a validated digital caliper tool; over-read must be performed by a board-certified cardiologist with clinical pharmacology or cardiology electrophysiology subspecialty. For each recording, at least 10 consecutive sinus beats must be individually measured and averaged; ectopic beats, paced beats, and beats following a pause must be excluded from the average. Inter-reader variability for QTc measurement must be ≤5 ms mean absolute difference across a randomly sampled 10% re-annotation subset; this metric must be reported in the dataset release documentation. De-identification must satisfy HIPAA Safe Harbour or equivalent GDPR pseudonymisation. Subject-level metadata must include age, sex, BMI, serum electrolyte values (potassium, magnesium, calcium) at time of recording, concomitant medication list at the drug class level, and heart rate at each time point. Any clinical-trial identifiers or site codes must be replaced with anonymised surrogate codes. Data originating from Phase I healthy-volunteer studies are particularly valuable as they represent a clean baseline population; data from patients with prolonged QTc at baseline (>450 ms men, >470 ms women) are equally important as high-sensitivity challenge cases. Data from patients currently receiving QT-prolonging agents (antiarrhythmics, antipsychotics, certain antibiotics) are especially valuable, as are recordings from subjects with known long-QT syndrome (congenital or acquired). Downstream use cases include regulatory-grade central ECG laboratory software for Phase I–III clinical trials, a precision-medicine tool stratifying drug candidates by proarrhythmic risk, and a QTc-monitoring dashboard embedded in hospital pharmacy systems to flag high-risk drug combinations in real time.

Sensor / device dataECGCSVEDF
0 / 5000 scans0%

15,000 12-Lead ECGs with STEMI and NSTEMI Labels for Acute MI Detection AI

Open

We are building a real-time, point-of-care AI system to detect ST-elevation myocardial infarction (STEMI) and non-ST-elevation myocardial infarction (NSTEMI) from the initial 12-lead ECG acquired in the emergency department. Early automated flagging of STEMI is a critical bottleneck in door-to-balloon time, and our system targets integration with existing ECG cart software to produce an alert within 10 seconds of acquisition completion. We require 15,000 12-lead ECG recordings at ≥500 Hz sampling rate, ≥12-bit amplitude resolution, with a minimum recording duration of 10 seconds. Files must be provided in WFDB or EDF format; machine-readable XML exports from GE MUSE or Philips TraceMaster systems accompanied by raw voltage data are also acceptable. The case mix must include: confirmed STEMI (by culprit-artery territory — anterior, inferior, lateral, posterior), confirmed NSTEMI, unstable angina with ECG changes, and normal/benign controls. Confirmed diagnoses must be backed by troponin results and, where available, catheterisation or echocardiography findings referenced in an accompanying clinical summary. Annotation requirements per record: ST-segment elevation or depression measurements (in mV, per lead), territory classification, cardiologist final diagnosis, and Killip class if available. Keypoint annotations for QRS onset, J-point, and ST-segment measurement point at 60–80 ms post-J-point are required for the training of measurement regression heads. The annotation labeling protocol requires a minimum of two independent cardiologist readers per record. Initial annotations are produced by an interventional cardiologist or senior cardiology registrar; a second interventional cardiologist performs blind over-read. In cases of STEMI/NSTEMI disagreement, a third senior cardiologist adjudicates. ST-elevation thresholds must follow the 2018 ESC Fourth Universal Definition of Myocardial Infarction criteria: ≥1 mm elevation in two contiguous limb leads, ≥2 mm in two contiguous precordial leads (≥2.5 mm in men under 40, ≥1.5 mm in women), or new left bundle branch block pattern with hemodynamic compromise. Inter-reader kappa for the STEMI binary label must be ≥0.85 across the contributed subset. De-identification is mandatory: all HIPAA-specified PHI must be removed from DICOM-ECG or MUSE XML headers, including patient ID, name, date of birth, admission date shifted to a relative offset, and institution identifiers. Only age, sex, cardiovascular risk factors (smoking status, hypertension, dyslipidaemia, diabetes, prior MI or PCI), symptom-onset-to-ECG time in minutes, and peak troponin value (categorical: negative, mildly elevated, markedly elevated) may be retained. Data must be de-identified with all protected health information removed. Age, sex, and major cardiovascular risk factors are essential metadata fields. We are particularly interested in recordings obtained within the first two hours of symptom onset, as early MI ECGs are substantially underrepresented in public datasets such as PTB-XL and MIMIC-IV-ECG. Downstream use cases include integration into emergency department triage workflows as a clinical decision support tool, training a multi-label classifier capable of simultaneously identifying STEMI territory, Wellens syndrome, and de Winter T-wave patterns, and serving as a benchmark dataset for regulatory submissions to the FDA under 510(k) substantial equivalence evaluation for AI-based ECG analysis software.

Sensor / device dataECGCSVEDFWFDB
0 / 15000 scans0%

8,000 Holter 24-Hour Ambulatory ECG Recordings for Arrhythmia Burden Quantification

Open

Our research group is developing an automated arrhythmia-burden analysis pipeline for long-duration ambulatory ECG data. We require a dataset of continuous 24-hour (or longer) Holter recordings collected from adult patients referred for ambulatory cardiac monitoring, covering a broad spectrum of rhythm disturbances including paroxysmal atrial fibrillation, premature ventricular contractions (PVCs), supraventricular ectopy, second- and third-degree AV block, and ventricular tachycardia runs. Technical requirements: recordings must include at minimum two channels (Lead II and a modified V5 derivation); three-channel recordings are preferred. Sampling rate must be ≥200 Hz with amplitude resolution ≥2.5 μV. The preferred file format is EDF or WFDB multi-segment; raw binary exports accompanied by a full header describing gain, offset, and channel labels are acceptable. Beat-level annotations following the AAMI EC57 annotation scheme (N, S, V, F, Q classes) produced by a certified cardiac physiologist and confirmed by a supervising cardiologist are mandatory. Episode-level labels indicating AF burden (percentage of recording time in AF), total PVC count, longest VT run duration, and overall arrhythmia classification are also required. The annotation workflow must enforce dual-reader review: an initial beat-by-beat annotation generated by a validated automated Holter analysis system (GE MARS, Spacelabs Oxford, or equivalent) must be manually reviewed and corrected by a credentialed cardiac physiologist, with a supervising electrophysiologist adjudicating all rhythm episodes exceeding 30 seconds. Inter-annotator reliability metrics, including percentage agreement for N, V, and S class beats across a 5% random re-annotation subset, must be reported and provided alongside the dataset. Label taxonomy must align with AAMI EC57 and EC38 standards to ensure compatibility with benchmark evaluations. De-identification must comply with applicable HIPAA or GDPR requirements. Patient age expressed as completed years at time of recording, sex, body-mass index, primary indication for monitoring (palpitations, syncope, breathlessness, post-ablation follow-up, or hypertension surveillance), and structural heart disease status must be preserved as structured metadata. Free-text diary entries or patient event logs must be reviewed and redacted before delivery; only event timing and general symptom category (palpitation, dizziness, presyncope, chest pain) should be retained. QA exclusion criteria: any 24-hour recording with more than 2 hours of uninterpretable signal due to electrode detachment or severe motion artefact must be flagged; recordings with total annotatable signal below 18 hours are excluded from the primary count but may be included as a supplementary low-quality subset. All data must be de-identified per applicable regulation; patient age, sex, and primary indication for monitoring should be preserved as structured metadata. We have a strong preference for recordings that include patient-activated event markers aligned to symptoms, as these allow supervised training of symptom-correlated arrhythmia models. Institutions contributing ≥500 recordings with complete beat-level annotation will receive priority payment processing. The target use case is a commercial-grade Holter analysis SaaS product currently in FDA Breakthrough Device evaluation, with a secondary research application targeting AF-burden-guided anticoagulation decision support integrated into cardiology electronic health record systems.

Sensor / device dataECGEDFWFDB
0 / 8000 scans0%

12,000 Resting 12-Lead ECG Recordings with Expert Atrial Fibrillation Annotations

Open

We are seeking a large, well-annotated dataset of resting 12-lead ECG recordings from adult patients to train and validate a deep-learning classifier for atrial fibrillation (AF) detection. The intended model architecture is a convolutional-recurrent network that operates directly on raw voltage traces, and its performance is highly sensitive to dataset size, annotation quality, and demographic diversity. Each recording must capture all twelve standard leads (I, II, III, aVR, aVL, aVF, V1–V6) at a minimum sampling rate of 500 Hz, with amplitude resolution of at least 1 μV (12-bit ADC or better). Recording duration must be ≥10 seconds per strip; longer 30-second captures are strongly preferred. Accepted file formats are WFDB (PhysioNet/MIT-BIH style header + signal files) or EDF; CSV exports with a standardised column schema are acceptable as a secondary option. Each record must be accompanied by a cardiologist-confirmed rhythm label — at minimum a binary AF / non-AF tag — with additional labels for flutter, supraventricular tachycardia, normal sinus rhythm, and sinus bradycardia strongly preferred. Keypoint annotations marking P-wave onset and offset, QRS complex onset, peak, and offset, and T-wave end are highly desirable for training auxiliary tasks. The labeling protocol must follow a two-stage review: a primary annotation produced by a board-certified cardiologist or credentialed cardiac physiologist, followed by independent over-read by a second annotator; disagreements must be adjudicated by a senior electrophysiologist. Inter-rater agreement (Cohen's kappa) should be reported per rhythm class and included in the dataset documentation. All annotations must use a standardised label taxonomy aligned with the AHA/ACC ECG terminology guidelines to ensure compatibility with publicly available benchmarks such as PhysioNet Challenge datasets and the MIMIC-IV-ECG corpus. De-identification must satisfy HIPAA Safe Harbour or an equivalent EU GDPR pseudonymisation standard: no patient name, date of birth, facility name, or accession numbers may appear in signal file headers or companion metadata files. Any free-text physician notes attached to the recording must be scrubbed using a validated PHI-detection NLP pipeline before delivery. Age bucket (decade), biological sex, and comorbidity flags (hypertension, heart failure, diabetes) should be retained as structured metadata fields. Quality exclusion criteria: recordings with any of the following must be flagged or removed — electrode reversal artefact detectable from lead polarity inversion, lead-off noise affecting more than two contiguous leads, baseline wander exceeding 0.5 mV peak-to-peak, or signal clipping. We require a balanced AF-to-non-AF ratio of no worse than 1:3, and we encourage inclusion of paroxysmal AF cases captured during or immediately after an episode, as these are clinically the hardest to classify and most valuable for model generalisation. Demographic balance across age groups (18–40, 41–60, 61–80, >80 years) and sex is mandatory. Downstream use cases include a real-time AF alert integrated into hospital ECG cart software, a cloud-based clinical decision support API, and federated training experiments across multiple institution nodes.

Sensor / device dataECGEDFWFDB
0 / 12000 scans0%

Related categories