High-Volume Tuberculosis Screening Chest X-Ray Dataset — 100,000 Images for Programmatic TB AI
OpenOverview
We are a global health technology organization developing and validating AI-assisted tuberculosis (TB) screening tools for deployment in high-burden, low-resource settings across Sub-Saharan Africa and South-East Asia. We require the largest possible dataset of de-identified chest radiographs with TB-related labels to train models that must generalize across diverse patient populations, scanner types, and acquisition conditions. Technical specifications: both posteroanterior (PA) and anteroposterior (AP) projections are accepted, as programmatic TB screening commonly uses portable or mobile X-ray units producing AP images. Images may be provided in DICOM, PNG, or TIFF format. For PNG and TIFF formats, minimum resolution is 1024x1024 at 8-bit depth; DICOM files should retain native acquisition resolution, which may range from 1500x1500 to 3000x3000 pixels. Images acquired on CR (computed radiography), DR (digital radiography), and analog-digitized film are all acceptable and should be tagged by acquisition modality in accompanying metadata. JSON metadata must record: TB outcome label (positive/negative/indeterminate), label source (sputum culture confirmation, GeneXpert MTB/RIF molecular assay, smear microscopy, radiologist read, or programmatic classification), treatment status if known, and any prior TB history. Labeling requirements: image-level labels are the primary annotation type required. CheXpert-style uncertainty labels (positive/negative/uncertain) are acceptable. Radiological findings associated with active pulmonary tuberculosis — upper lobe consolidation or infiltrate, cavitation, hilar or mediastinal lymphadenopathy, miliary nodular pattern, pleural effusion, and post-primary fibronodular scarring — should be recorded as secondary structured labels in the JSON sidecar file where available from radiologist reads. Images from HIV-positive patients are particularly valuable and should be flagged with HIV co-infection status in an anonymized binary field (HIV-positive: yes/no) without disclosing any additional identifying information. Acquisition diversity and QA criteria: because this dataset targets deployment in low-resource settings, images from a wide range of scanner quality levels are acceptable, including older CR plate systems, mobile digital units, and even digitized film. However, images must be of sufficient diagnostic quality for a trained radiologist to render a clinical read. Severely underexposed or overexposed radiographs, images with significant patient motion artifact, or films with physical damage artifacts should be excluded. Metadata must record scanner type, manufacturer, and approximate year of manufacture where available, as model performance subgroup analysis by acquisition platform is a planned research output. Geographic site metadata (country and WHO TB burden tier) should be included to enable site-level stratification. De-identification and data governance: all DICOM PHI must be removed per HIPAA Safe Harbor or equivalent national standard. For African and Asian contributing sites operating under different national data protection frameworks, data sharing agreements must confirm compliance with applicable local regulations (e.g., Kenya Data Protection Act, India PDPB, Indonesia GR No. 71). Burned-in pixel-data annotations including patient name, hospital identifier, or date of examination embedded at acquisition must be confirmed absent or redacted before delivery. WHO data governance guidelines for health AI datasets apply. This dataset will be used to develop WHO-compliant AI screening tools targeting sensitivity of 90% or higher and specificity of 70% or higher for TB triage in programmatic settings. All models trained on this data will be evaluated against a geographically diverse held-out test set. Contributing institutions will receive a license to the trained model for non-commercial programmatic use. Full compliance with applicable national data protection regulations and WHO data governance guidelines is required.
Progress
Data Specifications
| Category | Medical imaging |
|---|---|
| Required quantity | 100000 |
| Data types | Medical imaging, X-ray, Chest, DICOM, JSON, PNG / JPG, TIFF |
| Budget | USD 180000.00 |
| Deadline | 2027-06-02 |
Use Cases
- Training and validating Medical imaging AI/ML models
- Benchmarking Medical imaging detection and segmentation algorithms
- Building de-identified Medical imaging research datasets for academic studies
- Augmenting existing Medical imaging datasets to reduce class imbalance