Pathology

Pathology WSI Datasets — Whole Slide Imaging Data

Pathology datasets in computational and digital pathology center on whole-slide images (WSI), gigapixel digitizations of glass histology slides produced by slide scanners from vendors such as Aperio, Hamamatsu, and Leica at 20x or 40x objective magnification. They are foundational for training and validating machine-learning models in oncology, biomarker discovery, and computer-aided diagnosis, because the histopathology slide remains the diagnostic gold standard for cancer and many other diseases. A single whole-slide image can contain billions of pixels, so it is stored in pyramidal, tiled formats including SVS, NDPI, TIFF and BigTIFF, and increasingly DICOM WSI, with embedded metadata recording microns-per-pixel resolution, scanner make and model, objective power, and focus quality.

Because models cannot ingest a full slide at once, datasets are consumed through patch-based pipelines that crop tiles from the pyramid at a chosen magnification level. Slides span the two dominant staining families: routine hematoxylin and eosin (H&E) for morphology, and immunohistochemistry (IHC) stains that highlight specific proteins for biomarker assessment. Clinically valuable pathology datasets carry expert annotations from board-certified pathologists: region masks delineating tumor versus benign tissue, nuclei and cell segmentation, mitosis detection, tumor grading, Gleason grading and grade-group assignment for prostate, and quantitative biomarker scoring such as Ki-67 proliferation index, PD-L1 tumor proportion and combined positive scores, and tumor-infiltrating lymphocyte density.

Label schemas range from slide-level diagnoses to dense pixel-level masks, and the strongest cohorts document inter-observer agreement and re-adjudicate discordant cases. High-quality datasets address the color variation that arises from different stains, reagents, and scanners across labs, applying stain normalization and documenting scanner diversity so models generalize beyond a single institution; they also include quality control for out-of-focus regions, tissue folds, pen marks, and other artifacts, and quality-scored exclusion of unusable slides. Rigorous de-identification removes PHI from slide-label barcodes, burned-in label images, and case identifiers while preserving diagnostic tissue detail.

On GetDATA, researchers and medtech companies post requests specifying stain, organ and tumor type, scanner and magnification, annotation type (region mask, nuclei segmentation, or grade or score), label taxonomy, class balance, and minimum slide counts, and verified hospitals and labs fulfill them with compliant, quality-scored whole-slide imaging data. Established benchmarks and registries such as CAMELYON for lymph-node metastasis detection, PANDA for prostate Gleason grading, TCGA for multi-cancer slides with molecular linkage, and TUPAC for tumor proliferation help align labels across institutions and make datasets interoperable for reproducible, regulatory-grade model development. Browse the open pathology whole-slide imaging requests below, or explore related oncology and imaging categories.

Open Pathology requests

No open Pathology requests right now. Browse all open requests.

Related categories