Multi-Label Thoracic Pathology Dataset with Paired Radiology Reports — 40,000 DICOM Chest X-Rays
OpenWe are building the next generation of radiology report generation and multi-label thoracic pathology classification models, requiring a comprehensive chest radiograph dataset with both structured image-level labels and paired de-identified free-text radiology reports. The dataset design is inspired by publicly available benchmarks such as CheXpert and MIMIC-CXR but sourced from European and non-US institutions to ensure geographic and demographic generalizability.
Technical specifications: posteroanterior (PA) radiographs are required as the primary view; lateral views from the same study encounter are strongly requested as a supplementary acquisition. Images must be provided in DICOM format at native acquisition resolution (minimum 2048x2048, 12-bit depth), with DICOM headers de-identified per the DICOM PS 3.15 Annex E Basic Application Level Confidentiality Profile. Each study must include a paired de-identified radiology report containing at minimum the Findings and Impression sections. Report de-identification must remove all direct and quasi-identifiers — patient name, dates of service, referring physician name, hospital name, radiologist name — while preserving all clinical content including anatomical descriptions, measurements, and diagnostic conclusions. Free-text reports must be provided in UTF-8 encoded plain text or structured JSON with clearly demarcated Findings and Impression fields.
Labeling requirements: each image must carry structured labels for the following 14 thoracic conditions using CheXpert-style positive/negative/uncertain encoding: Atelectasis, Cardiomegaly, Consolidation, Edema, Enlarged Cardiomediastinum, Fracture, Lung Lesion, Lung Opacity, No Finding, Pleural Effusion, Pleural Other, Pneumonia, Pneumothorax, Support Devices. Labels may be extracted via NLP pipeline from the de-identified reports, following the CheXpert or NegBio labeling methodology, but must be reviewed for quality by a radiologist sample audit covering at least 10% of the labeled dataset. Bounding box annotations for at least Pleural Effusion and Cardiomegaly are desirable as an optional supplementary annotation layer, provided in COCO-format JSON.
Acquisition standards and QA criteria: radiographs must be of diagnostic quality — adequate inspiration, no severe rotation, no collimator cutoff, and no significant motion artifact. Images should span a minimum of five calendar years to capture protocol and equipment evolution across contributing sites. At least three distinct scanner vendors (e.g., Philips, Siemens, Agfa, Canon) should be represented to reduce vendor-specific bias in trained models. Patient demographics should be recorded in aggregate: age distribution (decade bins), sex distribution, and primary clinical indication for the radiograph (routine check-up, cardiac follow-up, respiratory symptoms, pre-operative clearance). Pediatric images (under 18 years) should be flagged and may constitute up to 10% of the dataset.
De-identification and compliance: DICOM header PHI must be removed using the Basic Application Level Confidentiality Profile (DICOM PS 3.15 Annex E) or an equivalent validated pipeline. For European sites, GDPR Article 9 applies to health data — a data processing agreement, records of processing activities under Article 30, and ethics committee approval or waiver documentation must be provided. Burned-in pixel annotations (patient name overlays, acquisition date stamps, institution watermarks embedded during image acquisition) must be confirmed absent or redacted using validated pixel-scrubbing software, with a sample audit confirming complete PHI removal.
Use cases include multi-label classification model training, automated radiology report generation research, retrieval-augmented diagnostic reasoning systems, and cross-institutional domain adaptation studies. This is an academic research program with planned open publication. Contributing institutions will be acknowledged as dataset partners in resulting publications, and a data consortium agreement will govern shared governance of the aggregate dataset.
Medical imagingX-rayDICOMJSON
0 / 40000 scans0%