Oncology

Oncology Datasets — Multimodal Cancer Cohort Data

Oncology datasets bring together the many data streams that define a cancer patient's journey into linked, multimodal cohorts, and they are among the most valuable resources for clinical AI because cancer diagnosis, staging, treatment selection, and prognosis depend on evidence drawn from several modalities at once. A well-constructed oncology cohort pairs, per patient, cross-sectional radiology (CT, MRI, and PET-CT for staging and response assessment), digital pathology whole-slide images of biopsy and resection specimens, genomic and molecular markers (mutation panels, gene-expression profiles, copy-number variation, microsatellite instability, PD-L1 and other biomarker status), laboratory values such as tumor markers and blood counts, and structured clinical data covering histology, grade, treatment regimens, and demographics. Imaging is delivered as DICOM series with acquisition metadata, pathology as multi-resolution WSI in formats such as SVS, NDPI, or DICOM-WSI with scanner and stain metadata, and molecular and clinical data as standardized tabular records, each carrying a de-identified linkage key so modalities can be aligned to the same patient and timepoint.

Clinically meaningful oncology datasets are built on established staging and response standards: AJCC and UICC TNM staging, RECIST 1.1 for measuring tumor response over time, and expert tumor segmentation and contouring (voxel-level lesion masks on imaging and region annotations on pathology slides) that localize disease rather than relying on study-level labels alone. The most powerful cohorts are longitudinal, following patients across treatment lines with outcome endpoints including overall survival, progression-free survival, treatment response category, and recurrence, which makes them suitable for prognostic and predictive modeling rather than single-timepoint classification. High-quality cohorts balance cancer types and stages, span multiple institutions, scanners, and pathology platforms so models generalize, document biomarker provenance and label confidence, and are rigorously de-identified across every modality — stripping PHI from DICOM headers, defacing volumes where required, removing slide-level and burned-in identifiers, and protecting genomic re-identification risk — while preserving the linkage keys that keep paired modalities aligned.

On GetDATA, researchers and medtech companies post requests specifying cancer type and site, required modalities, staging and response standards, annotation type, biomarker and molecular fields, follow-up duration, outcome endpoints, and minimum case counts, and verified hospitals and labs fulfill them with compliant, quality-scored data. Reference frameworks include tumor registries and large data-sharing efforts such as The Cancer Genome Atlas (TCGA) and The Cancer Imaging Archive (TCIA), whose paired imaging, pathology, and genomic resources have become de facto standards for multimodal oncology research and benchmarking. Browse the open oncology data requests below, or explore related pathology, imaging and clinical text categories.

Open Oncology requests

No open Oncology requests right now. Browse all open requests.

Related categories