50,000 Chest X-Rays with Pneumonia Classification Labels and Pathology Metadata

Open

Overview

We are a pulmonary AI research group developing a deep-learning classifier for community-acquired pneumonia detection using posteroanterior (PA) chest radiographs. We require a large, demographically diverse dataset of de-identified chest X-rays with confirmed image-level pneumonia labels, sourced from emergency department and outpatient radiology workflows. Technical specifications: PA-view radiographs are strongly preferred, with AP-view images acceptable if clearly marked. Images must be provided in DICOM format, preserving full-resolution acquisition data (minimum 2048x2048 pixels, 12-bit depth). Each image must include DICOM header metadata (acquisition parameters, patient age range, biological sex) after de-identification per HIPAA Safe Harbor or equivalent regulatory standard. Accompanying JSON metadata per file must record the image-level label (pneumonia-positive / pneumonia-negative / indeterminate), label source (radiologist consensus, single radiologist, or structured report extraction), and any co-existing findings such as pleural effusion or pulmonary consolidation. Labeling requirements follow CheXpert-style conventions: each image-level label must indicate positive, negative, or uncertain for pneumonia. Where available, the de-identified radiology report impression section should be included as free text. Datasets with at least 30% positive cases are preferred to avoid extreme class imbalance. Labels must originate from board-certified radiologists; AI-generated labels are acceptable only as a secondary annotation layer, clearly flagged. Acquisition and QA criteria: images must pass a minimum quality threshold — no severe motion blur, no collimator cut-off, and adequate exposure index (EI). Radiographs acquired on digital radiography (DR) systems are preferred, although computed radiography (CR) plate-based images are acceptable. Scanner vendor and model must be recorded in metadata to enable downstream subgroup analysis by acquisition system. Images from at least three distinct scanner manufacturers (e.g., Philips, Siemens, GE Healthcare) are requested to ensure vendor diversity. Pediatric studies (patients under 18) must be flagged and may be included as a stratified subset, but adult studies age 18–85 constitute the core population. Exclusion criteria include post-pneumonectomy images, images with burned-in patient annotations that cannot be removed without cropping clinically relevant lung regions, and images where de-identification of DICOM PHI cannot be confirmed. De-identification compliance: all DICOM files must be de-identified according to HIPAA Safe Harbor (45 CFR § 164.514(b)) or the DICOM PS 3.15 Annex E Basic Profile, removing all 18 PHI categories including patient name, birth date, admission date, and device identifiers. Burned-in text annotations in pixel data (e.g., patient initials, laterality markers embedded via image acquisition) must be verified absent or redacted. GDPR-equivalent standards apply to European-sourced data. This dataset will be used to train and externally validate pneumonia triage models intended for low-resource clinical settings. Results will be published in peer-reviewed journals. Institutional data sharing agreement and IRB confirmation of waiver or approval will be provided prior to transfer. Hospitals must confirm the absence of re-identification risk under the provided de-identification protocol.

Medical imagingX-rayChestDICOMJSON

Progress

0 / 50000 scans0%

Data Specifications

CategoryMedical imaging
Required quantity50000
Data typesMedical imaging, X-ray, Chest, DICOM, JSON
BudgetUSD 95000.00
Deadline2026-11-29

Use Cases

  • Training and validating Medical imaging AI/ML models
  • Benchmarking Medical imaging detection and segmentation algorithms
  • Building de-identified Medical imaging research datasets for academic studies
  • Augmenting existing Medical imaging datasets to reduce class imbalance