GitHub - WGLab/PhenoSnap: Rapid extraction of phenotypes and medical entities into Phenopackets from natural texts

Clinical phenotype extraction (local, HPO-based) and VCF / ClinVar pathogenic lookup

This repository provides two utilities:

Phenotype extraction: Extracts clinical phenotype mentions from free text and maps them to Human Phenotype Ontology (HPO) terms using only local resources (no cloud-based LLMs or APIs). Output can be a TSV file or a PhenoPacket JSON file (conforming to PhenoPacket Schema v2.0) with matched phenotypes, age of onset, family history, and medication information.
VCF / ClinVar pathogenic lookup: Takes a VCF file (GRCh37 or GRCh38 coordinates), queries the ClinVar database via NCBI E-utilities, and outputs only variants annotated as Pathogenic or Likely pathogenic.

1. Install dependencies

From the project directory:

pip install -r requirements.txt

Then install a local spaCy English model (only needs to be done once per environment):

python -m spacy download en_core_web_sm

Note: The script uses spaCy for advanced NLP capabilities including dependency parsing for accurate negation detection. If you encounter issues downloading the spaCy model, see troubleshooting below.

2. Download the HPO ontology (.obo)

You need to download the HPO OBO file. Here are several methods:

Method 1: Direct download via browser

Visit: https://github.com/obophenotype/human-phenotype-ontology/releases/latest
Look for the latest release and download hp.obo (or hp-base.obo)
Alternatively, direct download link: https://github.com/obophenotype/human-phenotype-ontology/raw/master/hp.obo
Save the file to your project directory or a convenient location (e.g., C:\data\hp.obo or ./hp.obo)

Method 2: Download using PowerShell (Windows)

# Download to current directory
Invoke-WebRequest -Uri "https://github.com/obophenotype/human-phenotype-ontology/raw/master/hp.obo" -OutFile "hp.obo"

# Or download to a specific location
Invoke-WebRequest -Uri "https://github.com/obophenotype/human-phenotype-ontology/raw/master/hp.obo" -OutFile "C:\data\hp.obo"

Method 3: Download using curl (Windows/Linux/macOS)

# Download to current directory
curl -L -o hp.obo "https://github.com/obophenotype/human-phenotype-ontology/raw/master/hp.obo"

# Or download to a specific location
curl -L -o C:\data\hp.obo "https://github.com/obophenotype/human-phenotype-ontology/raw/master/hp.obo"

Method 4: Download using Python script (easiest)

# Use the included helper script
python download_hpo.py --output hp.obo

# Or specify a custom location
python download_hpo.py --output C:\data\hp.obo

Method 5: Download using Python code

import urllib.request
urllib.request.urlretrieve(
    "https://github.com/obophenotype/human-phenotype-ontology/raw/master/hp.obo",
    "hp.obo"
)
print("Downloaded hp.obo successfully!")

Note: The file is typically around 50-100 MB in size, so the download may take a minute or two depending on your internet connection.

3. Run the extractor

You can either pass a paragraph directly via --text or use an input file via --input-file.

Output formats:

--format tsv (default): Tab-separated values file
--format json: PhenoPacket JSON file (includes age of onset, family history, medications)
Using --text with TSV output:

python extract_phenotypes.py ^
  --text "The patient has short stature and developmental delay." ^
  --hpo-obo path\to\hp.obo ^
  --output phenotypes.tsv ^
  --format tsv

Using --input-file with JSON output:

python extract_phenotypes.py ^
  --input-file example_input.txt ^
  --hpo-obo path\to\hp.obo ^
  --output phenotypes.json ^
  --format json

On Unix-like systems (Linux/macOS), the same commands look like:

python extract_phenotypes.py \
  --text "The patient has short stature and developmental delay." \
  --hpo-obo path/to/hp.obo \
  --output phenotypes.tsv \
  --format tsv

4. Output formats

TSV Format (default)

The TSV output file contains tab-separated values with header:

phrase: the surface text of the phenotype mention in the paragraph
hpo_id: the matched HPO identifier (e.g. HP:0004322)
hpo_label: the primary HPO label for that ID
start_char: character offset (0-based) where the mention starts in the input text
end_char: character offset (0-based, exclusive) where the mention ends in the input text
onset: ISO 8601 duration format for age of onset (e.g., P14M for 14 months), if detected
excluded: boolean indicating if the phenotype is negated/excluded (e.g., "No history of seizures" → True)

Example phenotypes.tsv:

phrase	hpo_id	hpo_label	start_char	end_char	onset	excluded
short stature	HP:0004322	Short stature	16	29	P14M	False
developmental delay	HP:0001263	Developmental delay	34	52	P14M	False
seizures	HP:0001250	Seizure	123	131		True

PhenoPacket JSON Format

The JSON output conforms to PhenoPacket Schema v2.0 and includes:

phenotypicFeatures: Array of phenotypic features with HPO terms
- Each feature includes HPO ID, label, description, and optional onset (age of onset)
medicalActions: Array of treatments/medications if detected in text
- Includes drug name and route of administration
subject: Patient age information if age of onset is detected
metaData: Metadata including creation timestamp, schema version, and ontology resources
metaData.notes: Family history mentions if detected

Example phenotypes.json structure:

{
  "id": "phenopacket_20260220_123456",
  "phenotypicFeatures": [
    {
      "type": {
        "id": "HP:0004322",
        "label": "Short stature"
      },
      "description": "short stature",
      "excluded": false,
      "onset": {
        "age": {
          "iso8601duration": "P14M"
        }
      }
    },
    {
      "type": {
        "id": "HP:0001250",
        "label": "Seizure"
      },
      "description": "seizures",
      "excluded": true
    }
  ],
  "medicalActions": [
    {
      "treatment": {
        "agent": {
          "id": "DRUG:ASPIRIN",
          "label": "Aspirin"
        },
        "routeOfAdministration": {
          "id": "NCIT:ORAL",
          "label": "Oral"
        }
      }
    }
  ],
  "subject": {
    "timeAtEncounter": {
      "age": {
        "iso8601duration": "P14M"
      }
    }
  },
  "metaData": {
    "created": "2026-02-20T12:34:56Z",
    "createdBy": "phenotype-extractor",
    "phenopacketSchemaVersion": "2.0",
    "resources": [...],
    "notes": ["Family history: maternal - diabetes"]
  }
}

Extracted Information:

Age of onset: Automatically extracted from phrases like "14-month-old", "at age 2 years"
Family history: Detected from phrases like "family history of...", "maternal...", "paternal..."
Medications: Detected from phrases like "on Aspirin", "taking medication", "oral Metformin"

5. Notes and limitations

Local only: the script uses spaCy NLP and a local HPO OBO file; no calls to remote LLMs or web APIs are made during extraction.
NLP-based matching: extraction uses spaCy's PhraseMatcher for accurate phrase matching against HPO labels and synonyms, with dependency parsing for negation detection.
Negation detection: uses spaCy's dependency parsing to accurately detect negated phenotypes (e.g., "No history of seizures") by analyzing the dependency tree structure.
Performance: building the phrase matcher can take a bit of time on first run, since it processes all HPO terms and synonyms. The script prioritizes longer phrases to avoid partial matches.
Age of onset extraction: Uses pattern matching to detect common age expressions (e.g., "14-month-old", "at age 2 years"). May not capture all variations.
Family history extraction: Basic pattern matching for common family history phrases. More complex family structures may require manual annotation.
Medication extraction: Detects common medication patterns but may miss less common drug names or misspelled medications.
Negation detection: Automatically detects negated phenotypes (e.g., "No history of seizures", "absence of", "denies") and marks them with excluded: true in the output. Negated phenotypes are still included in the output but clearly marked as excluded, consistent with PhenoPacket standard.
PhenoPacket compliance: The JSON output conforms to PhenoPacket Schema v2.0 and can be validated using PhenoPacket validation tools.

6. VCF / ClinVar pathogenic variant lookup

The script vcf_clinvar_pathogenic.py reads a VCF file (GRCh37 or GRCh38), converts each variant to SPDI format, queries the ClinVar database via NCBI E-utilities, and writes only variants that are classified as Pathogenic or Likely pathogenic to a TSV file.

Dependencies: The script uses the requests library (included in requirements.txt). No HPO or spaCy model is required for this script.

Run the script

python vcf_clinvar_pathogenic.py --vcf input.vcf --out hits.tsv --assembly GRCh38

Arguments:

--vcf (required): Path to the input VCF file.
--out (required): Path to the output TSV file.
--assembly: Reference assembly: GRCh38 (default) or GRCh37.

Example (Windows):

python vcf_clinvar_pathogenic.py --vcf variants.vcf --out clinvar_hits.tsv --assembly GRCh38

Example (Linux/macOS):

python vcf_clinvar_pathogenic.py --vcf variants.vcf --out clinvar_hits.tsv --assembly GRCh37

Output format

The output is a tab-separated file with header:

Column	Description
CHROM	Chromosome (from VCF)
POS	Position (1-based)
REF	Reference allele
ALT	Alternate allele
ClinVarID	ClinVar variation ID
Significance	Clinical significance (e.g. Pathogenic, Likely pathogenic)
ReviewStatus	ClinVar review status

Only variants that have at least one ClinVar record with Pathogenic or Likely pathogenic significance are included.

Notes and limitations (VCF / ClinVar script)

Network required: The script queries NCBI E-utilities (ClinVar) over the internet.
Rate limiting: Requests are throttled to about 3 per second to comply with NCBI usage guidelines; processing large VCFs can take time.
SPDI-based lookup: Variants are converted to SPDI (Sequence, Position, Deletion, Insertion) using RefSeq accessions for the chosen assembly. Unrecognized chromosomes are skipped.
Multi-allelic sites: Each alternate allele is queried separately.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clinical phenotype extraction (local, HPO-based) and VCF / ClinVar pathogenic lookup

1. Install dependencies

2. Download the HPO ontology (.obo)

3. Run the extractor

4. Output formats

TSV Format (default)

PhenoPacket JSON Format

5. Notes and limitations

6. VCF / ClinVar pathogenic variant lookup

Run the script

Output format

Notes and limitations (VCF / ClinVar script)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
clinvar_monthly.md		clinvar_monthly.md
download_hpo.py		download_hpo.py
example_input.txt		example_input.txt
extract_phenotypes.py		extract_phenotypes.py
hp.obo		hp.obo
phenotypes.json		phenotypes.json
phenotypes.tsv		phenotypes.tsv
requirements.txt		requirements.txt
vcf_clinvar_pathogenic.py		vcf_clinvar_pathogenic.py

Folders and files

Latest commit

History

Repository files navigation

Clinical phenotype extraction (local, HPO-based) and VCF / ClinVar pathogenic lookup

1. Install dependencies

2. Download the HPO ontology (.obo)

3. Run the extractor

4. Output formats

TSV Format (default)

PhenoPacket JSON Format

5. Notes and limitations

6. VCF / ClinVar pathogenic variant lookup

Run the script

Output format

Notes and limitations (VCF / ClinVar script)

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages