Clinical Text Datasets — Clinical NLP & Notes Data
Clinical text datasets capture the free-text narrative of care, the unstructured documentation where clinicians record reasoning, findings, and plans that rarely fit into discrete fields, and they are foundational for clinical natural-language processing and large language models in medicine. These corpora span the full range of note types: discharge summaries, progress notes, history and physical (H&P) documents, nursing and consult notes, operative reports, and the radiology and pathology reports that pair narrative impressions with imaging and tissue findings. Documents are typically delivered as plain text, JSON, or XML, sometimes embedded in HL7 messages or FHIR DocumentReference and DiagnosticReport resources, and the strongest datasets retain document metadata such as note type, specialty, encounter context, author role, and timestamps while stripping anything that could identify the patient.
Clinically meaningful clinical text datasets carry expert annotations aligned to well-defined NLP tasks: named-entity recognition of problems, medications, laboratory tests, procedures, and anatomy; assertion, negation, and temporality classification that distinguishes present from absent, historical, hypothetical, or family-attributed findings; relation extraction linking drugs to dosages and adverse events; and document-level tasks such as summarization and phenotyping. The most valuable corpora map spans to standardized terminologies, including ICD-10 diagnosis and procedure codes, SNOMED CT concepts, RxNorm for medications, and LOINC for laboratory observations, so labels are interoperable across institutions, and many adopt established annotation schemas from i2b2 and the n2c2 shared tasks. High-quality cohorts are demographically and site diverse, document inter-annotator agreement and label provenance, and are rigorously de-identified to HIPAA Safe Harbor, removing all 18 protected identifiers, or under Expert Determination, with quality scoring for annotation completeness and surrogate-replacement fidelity.
On GetDATA, researchers and medtech companies post requests specifying note types, target NLP tasks, entity and label taxonomy, coding system, annotation density, language, and minimum document counts, and verified hospitals and labs fulfill them with compliant, quality-scored clinical text data. Reference corpora such as MIMIC have shown both the value of large de-identified note collections and the difficulty of generalizing across hospitals, since documentation style, templating, and abbreviation use vary widely, making multi-site curation and external validation essential rather than optional. Whether you are training clinical entity recognizers, building automated coding and phenotyping pipelines, or fine-tuning medical language models for summarization, well-annotated clinical text is the difference between a model that reflects real practice and one that overfits a single institution's conventions.
Browse the open clinical text requests below, or explore related oncology and imaging categories.