Skip to content

lbcb-sci/get_human_virus_refseq

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Human Virus RefSeq Coding Sequences

A curated dataset of paired DNA and protein sequences for all annotated protein-coding genes across NCBI RefSeq reference genomes of viruses that infect Homo sapiens.

Each record links a CDS nucleotide sequence to its translated protein, with 5' and 3' UTR context extracted where the full genomic RNA sequence is available.


Dataset at a Glance

Metric Value
Source NCBI RefSeq
Taxon Viruses (taxid 10239)
Host scope Homo sapiens
Coverage Reference genomes — one per virus species
Records 14,528 CDS–protein pairs
Unique virus species 869
Output file final/viruses_dna_protein.parquet (zstd, ~9 MB)
Validation 10/10 checks pass

Note: The final Parquet file and intermediate downloads are excluded from this repository. See Reproducing the Dataset to rebuild.


Schema

Each row is one protein-coding CDS. Primary key: protein_accession.

Column Type Description
genome_accession string NCBI assembly accession (GCF_...)
virus_name string Virus scientific name
taxon_id int NCBI taxon ID
genome_type string Genome class (e.g. ssRNA(+), dsDNA)
host string Host organism (always "Homo sapiens")
transcript_accession string RefSeq transcript or genomic accession (NM_/NC_/etc.)
protein_accession string RefSeq protein accession (NP_/YP_/XP_/WP_) — unique key
gene_name string Standard gene symbol
protein_name string Product name from RefSeq annotation
mrna_sequence string Full mRNA / genomic RNA nucleotide sequence (null if unavailable)
utr5_sequence string 5' UTR sequence (empty string if not annotated)
cds_sequence string Coding sequence (ATG through stop codon, inclusive)
utr3_sequence string 3' UTR sequence (empty string if not annotated)
protein_sequence string Translated amino acid sequence
cds_start_in_mrna int 0-based start of CDS within mRNA (null if no mRNA)
cds_end_in_mrna int Exclusive end of CDS within mRNA (null if no mRNA)
has_utr bool True if at least one UTR sequence is non-empty

Verified invariants (all 14,528 records):

  • len(cds_sequence) % 3 == 0
  • cds_sequence begins with ATG and ends with TAA, TAG, or TGA
  • protein_accession is globally unique
  • utr5_sequence + cds_sequence + utr3_sequence == mrna_sequence (for UTR-bearing records)

Quick Start

import polars as pl

df = pl.read_parquet("final/viruses_dna_protein.parquet")
print(df.shape)        # (14528, 17)
print(df.dtypes)

# All CDS for SARS-CoV-2
sarscov2 = df.filter(pl.col("virus_name").str.contains("SARS-CoV-2"))

# CDS length distribution per virus family
df.group_by("virus_name").agg(
    pl.col("cds_sequence").str.len_chars().median().alias("median_cds_nt"),
    pl.len().alias("n_cds"),
).sort("n_cds", descending=True)

# All CDS for a specific gene across all viruses
rdrp = df.filter(pl.col("gene_name") == "RdRp")

Reproducing the Dataset

Requirements: Python 3.10+, NCBI API key (optional but recommended), ~100 MB disk.

pip install -r requirements.txt
# 1. Enumerate human-infecting viral reference genomes from NCBI
python pipeline/phase1_enumerate_virus_genomes.py [--api-key YOUR_KEY]

# 2. Download genome packages from NCBI FTP (~38 MB)
python pipeline/phase2_download_genomes.py [--workers 4] [--api-key YOUR_KEY]

# 3. Parse feature tables + FASTAs, extract CDS/UTRs, write per-genome Parquets
python pipeline/phase3_parse_and_pair.py [--workers 8]

# 4. Deduplicate and merge into final Parquet
python pipeline/phase4_dedup_and_merge.py

# 5. Validate (10 quality checks)
python pipeline/phase5_validation.py

NCBI API key (free): raises rate limit from 3 to 10 req/sec. Register at https://www.ncbi.nlm.nih.gov/account/

The data/virus_genomes.jsonl file in this repo is the Phase 1 output — you can skip straight to Phase 2 if you use it as-is.


Repository Layout

.
├── pipeline/
│   ├── phase1_enumerate_virus_genomes.py   # NCBI Genome Assembly API enumeration
│   ├── phase2_download_genomes.py          # FTP + Entrez fallback download
│   ├── phase3_parse_and_pair.py            # CDS/UTR extraction, per-genome Parquet
│   ├── phase4_dedup_and_merge.py           # Dedup on protein_accession, final Parquet
│   └── phase5_validation.py               # 10 quality checks
├── data/
│   └── virus_genomes.jsonl                # Phase 1 output: 1,897 genome records
├── requirements.txt
└── README.md

Pipeline Design

Phase 1 — Enumerate viral reference genomes

The NCBI Genome Assembly API has no reliable host filter. Phase 1 works around this by querying a curated list of ~30 taxon IDs covering the major human-pathogenic viral families (Coronaviridae, Flaviviridae, Retroviridae, Herpesviridae, Orthomyxoviridae, Paramyxoviridae, Picornaviridae, Papillomaviridae, Poxviridae, Adenoviridae, Hepadnaviridae, Caliciviridae, Reoviridae, Filoviridae, Arenaviridae, Rhabdoviridae, and others).

For each family, the Genome Assembly API (/datasets/v2/genome/taxon/{taxid}/dataset_report) returns all RefSeq assemblies. Only assemblies with a GCF_ accession and a valid FTP path are retained. The result is 1,897 viral reference genome records.

Phase 2 — Download

Four files per genome from the NCBI FTP: _cds_from_genomic.fna.gz, _feature_table.txt.gz, _rna.fna.gz, _translated_cds.faa.gz.

The RNA file (_rna.fna.gz) is treated as optional; most viral RefSeq assemblies do not include it (the CDS is annotated directly on the genomic sequence). An Entrez efetch fallback downloads a GenBank record for any NC_-only genomes lacking an FTP package. Downloads are idempotent.

Phase 3 — Parse and pair

  • The feature table maps protein_id → genomic_accession (NC_...) for viral CDS (unlike mammalian genomes where it maps to a separate mRNA transcript).
  • CDS FASTA supplies nucleotide sequences; protein FASTA supplies amino acid sequences.
  • UTR extraction: CDS is located by exact substring search within the full RNA/genomic sequence. Since most viral assemblies lack a separate RNA file, has_utr = False for the majority of records.
  • Non-CDS features (mat_peptide, misc RNA) excluded via [gbkey=CDS] filter.
  • Pseudogenes excluded via [pseudo=true] filter.
  • ~84 non-human viruses (plant viruses and bovine/rodent-specific viruses) that entered through broad family-level enumeration are filtered out by name pattern.

Phase 4 — Deduplicate and merge

  • Dedup key: protein_accession (globally unique per protein isoform).
  • Priority order when multiple assemblies share a protein: refseq_category = reference genome > reference > other, then by annotated gene count.
  • has_utr is recomputed post-merge from UTR sequence content.

Phase 5 — Validation

Ten checks: schema completeness, null audit on required fields, protein_accession uniqueness, accession format patterns (NP_/YP_/XP_/WP_ proteins; NC_/NM_/XM_ transcripts), full IUPAC nucleotide and amino acid alphabets, CDS divisibility by 3, ATG start / TAA-TAG-TGA stop codons, UTR consistency, mRNA reconstruction (UTR5 + CDS + UTR3 == mRNA), and CDS coordinate bounds.


Known Limitations

  • No UTRs for most records. NCBI viral RefSeq assemblies do not provide separate RNA FASTA files; _rna.fna.gz is absent for nearly all genomes. UTR extraction requires downloading the full genomic FASTA separately.
  • genome_type is "unknown" for all records. The NCBI Virus API does not reliably return molecule type in its current dataset report format.
  • Broad family enumeration. Querying at the family level captures some animal-specific viruses within the same families as human pathogens (e.g. equine and avian herpesviruses). These are not filtered out beyond obvious name-based exclusions.
  • Polyprotein mature peptides excluded. Only primary ORF CDS are included; mat_peptide features (e.g. individual SARS-CoV-2 nsps) are intentionally excluded.
  • Frameshifted CDS excluded. Programmed ribosomal frameshifts (e.g. ORF1ab) may fail the standard QC checks and are dropped.

Source & License

Database: NCBI RefSeq Taxon: Viruses (taxid 10239) Host scope: Homo sapiens (taxid 9606) NCBI data is in the public domain. See NCBI disclaimer.


Reference

Adapted from the Primate RefSeq CDS pipeline, which built an equivalent dataset for 35 primate species (2.3 M records, 2.1 GB).

About

Paired DNA+protein CDS dataset for human-infecting viruses from NCBI RefSeq reference genomes

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages