Pulmonary Nodule and Lung Mass Dataset with Radiologist Bounding Box Annotations — 15,000 Images

Open

Overview

We are a digital health startup developing a pulmonary nodule detection and malignancy risk-stratification model for integration into PACS worklist prioritization. We require chest radiographs with radiologist-annotated bounding boxes around pulmonary nodules and masses, covering a range of sizes, densities, and anatomical locations. Technical specifications: posteroanterior (PA) chest radiographs are the primary acquisition type required; lateral views for the same patient encounter are requested as supplementary data where available. Images must be supplied in DICOM format at full acquisition resolution (minimum 2048x2048, 14-bit depth). Bounding box annotations must be provided in JSON format, with each annotation record containing: bounding box coordinates (x, y, width, height in pixels), nodule diameter estimate in millimeters, location descriptor (upper/middle/lower zone, left/right), density category (solid, ground-glass, part-solid), and Fleischner Society size category. Cases must include confirmed nodules measuring 6mm or larger. Images with no detectable nodules should constitute 40–50% of the dataset to ensure realistic negative sampling. Clinical label requirements: each nodule case should carry an image-level label indicating benign, malignant, or indeterminate based on available follow-up imaging, biopsy, or multidisciplinary tumor board (MDT) consensus. If histopathological confirmation is available, this should be recorded in the JSON metadata. Annotations must originate from radiologists with subspecialty thoracic experience. AI pre-annotation is acceptable provided each bounding box was reviewed and approved by a radiologist. Acquisition and QA criteria: radiographs must meet a minimum image quality standard — adequate lung expansion (at least 8–10 posterior ribs visible), no severe rotation (the spinous processes should be equidistant from the medial clavicular ends), and no motion artifact degrading nodule visualization. Images acquired at 100–125 kVp with appropriate mAs are preferred. Scanner vendor metadata should be retained and diverse representation from multiple manufacturers including GE Healthcare, Siemens Healthineers, Fujifilm, and Philips is requested. Pediatric cases (under 18 years) should be excluded unless specifically from a TB-endemic population study; the target demographic is adults aged 40–80 with a smoking history or incidental nodule finding. Cases with prior lobectomy, pneumonectomy, or extensive pleural plaques overlying the nodule region should be excluded or flagged. De-identification and regulatory compliance: all DICOM headers must be de-identified per HIPAA Safe Harbor or equivalent standard, removing patient name, date of birth, institution name, and all date fields. Burned-in text overlays in the image pixel data — such as institution watermarks, patient demographic banners, or laterality labels embedded at acquisition — must be removed without cropping any portion of the lung parenchyma. Compliance documentation including IRB approval number and de-identification method attestation must accompany each contributing institution's data submission. Primary use cases: this dataset will train a nodule detection model targeting sensitivity of 90% or higher at clinically relevant specificity, for integration as a second-reader CAD tool. The model is intended for deployment in community radiology settings without subspecialty thoracic coverage. Data will not be shared beyond the contracted research team without explicit re-consent from the contributing institution.

Medical imagingX-rayChestDICOMJSON

Progress

0 / 15000 scans0%

Data Specifications

CategoryMedical imaging
Required quantity15000
Data typesMedical imaging, X-ray, Chest, DICOM, JSON
BudgetUSD 75000.00
Deadline2027-02-27

Use Cases

  • Training and validating Medical imaging AI/ML models
  • Benchmarking Medical imaging detection and segmentation algorithms
  • Building de-identified Medical imaging research datasets for academic studies
  • Augmenting existing Medical imaging datasets to reduce class imbalance